Upload
independent
View
0
Download
0
Embed Size (px)
Citation preview
ARTICLE IN PRESS
Information Systems 29 (2004) 551ndash582
$This work
Education Un
COFIN 2001 P
for Data Qual
(httpwwwd
ject is a joint res
Roma lsquolsquoLa Sap
carsquorsquo and DEImdash
Correspond
06853-00849
E-mail addr
(M Scannapiec
marchetdisu
uniroma1it (M
0306-4379$ - se
doi101016jis
The DaQuinCIS architecture a platform for exchanging andimproving data quality in cooperative information systems$
Monica Scannapieco Antonino Virgillito Carlo Marchetti Massimo MecellaRoberto Baldoni
Dipartimento di Informatica e Sistemistica Universit a di Roma lsquolsquoLa Sapienzarsquorsquo Via Salaria 113 Roma 00198 Italy
Received 30 May 2003 received in revised form 1 November 2003 accepted 15 December 2003
Abstract
In cooperative information systems the quality of data exchanged and provided by different data sources is
extremely important A lack of attention to data quality can imply data of low quality to spread all over the cooperative
system At the same time improvement can be based on comparing data correcting them and thus disseminating high
quality data In this paper we present an architecture for managing data quality in cooperative information systems by
focusing on two specific modules the Data Quality Broker and the Quality Notification Service The Data Quality
Broker allows for querying and improving data quality values The Quality Notification Service is specifically targeted
to the dissemination of changes on data quality values
r 2003 Published by Elsevier Ltd
Keywords Cooperative information system Data quality XML model Data integration Publish amp subscribe
1 Introduction
A Cooperative Information System (CIS) is alarge scale information system that interconnects
is supported by the Italian Ministry for
iversity and Research (MIUR) through the
roject lsquolsquoDaQuinCISmdashMethodologies and Tools
ity inside Cooperative Information Systemsrsquorsquo
isuniroma1it~dq) The DaQuinCIS pro-
earch project carried out by DISmdashUniversita di
ienzarsquorsquo DISComdashUniversita di Milano lsquolsquoBicoc-
Politecnico di Milano
ing author Tel +39-06499-18479 fax +39-
esses monscandisuniroma1it
o) virgidisuniroma1it (A Virgillito)
niroma1it (C Marchetti) mecelladis
Mecella) baldonidisuniroma1it (R Baldoni)
e front matter r 2003 Published by Elsevier Ltd
200312004
various systems of different and autonomousorganizations geographically distributed andsharing common objectives [1] Among the differ-ent resources that are shared by organizationsdata are fundamental in real world scenarios anorganization A may not request data from anorganization B if it does not lsquolsquotrustrsquorsquo B data ie ifA does not know that the quality of the data thatB can provide is high As an example in ane-Government scenario in which public adminis-trations cooperate in order to fulfill servicerequests from citizens and enterprises [2] admin-istrations very often prefer asking citizens for datarather than other administrations that have storedthe same data because the quality of such data isnot known Therefore lack of cooperation mayoccur due to lack of quality certification
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582552
Uncertified quality can also cause a deteriora-tion of the data quality inside single organizationsIf organizations exchange data without knowingtheir actual quality it may happen that data of lowquality spread all over the CIS On the other handCISrsquos are characterized by high data replicationie different copies of the same data are stored bydifferent organizations From a data qualityperspective this is a great opportunity improve-ment actions can be carried out on the basis ofcomparisons among different copies in ordereither to select the most appropriate one or toreconcile available copies thus producing a newimproved copy to be notified to all interestedorganizations
In this paper we propose an architecture formanaging data quality in CISrsquos The architectureaims to avoid dissemination of low qualified datathrough the CIS by providing a support for dataquality diffusion and improvement To the best ofour knowledge this is a novel contribution in theinformation quality area that aims at integratingin the specific context of CISrsquos both modeling andarchitectural issues
The TDQM CIS methodological cycle [3] de-fines the methodological framework in which the
Fig 1 The phases of the TDQM CIS cycle grey p
proposed architecture fits The TDQM CIS cyclederives from the extension of the TDQM cycle [4]to the context of CISrsquos it consists of the fivephases of Definition Measurement ExchangeAnalysis and Improvement (see Fig 1)
The Definition phase implies the definition of amodel for the data exported by each cooperatingorganization and for the quality data associated tothem Quality is a multi-dimensional concept data
quality dimensions characterize properties that areinherent to data [56] Both data and quality datacan be exported as XML documents as furtherdetailed in Section 3 The Measurement phaseconsists of the evaluation of the data qualitydimensions for the exported data The Exchange
phase implies the exact definition of the exchangedinformation consisting of data and of appropriatequality data corresponding to values of dataquality dimensions The Analysis phase regardsthe interpretation of the quality values containedin the exchanged information by the organizationthat receives it Finally the Improvement phaseconsists of all the actions that enable improve-ments of cooperative data by exploiting theopportunities offered by the cooperative environ-ment An example of improvement action is based
hases are specifically considered in this work
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 553
on the analysis phase described above interpretingthe quality of exchanged information gives theopportunity of sending accurate feedbacks to datasource organizations which can then implementcorrection actions to improve their quality
The architecture presented in this paper consistsof different modules supporting each of thedescribed phases Nevertheless in this paper wefocus on two specific modules namely the DataQuality Broker and the Quality NotificationService which mainly support the Exchange andImprovement phases of the TDQM CIS cycle Amodel specifically addressing the Definition phaseis also discussed In Fig 1 the phases of theTDQM CIS cycle specifically addressed in thispaper are grey-colored
The structure of the paper is as follows InSection 2 a framework for Cooperative Informa-tion Systems is described and the proposedarchitecture is outlined furthermore a runningexample for exemplifying the contributions isintroduced the example stems from the Italiane-Government scenario In Section 3 the modelfor both the exchanging of data and their quality ispresented In Section 4 the Data Quality Broker isdescribed and its distributed design is discussed InSection 5 the Quality Notification Service isillustrated In Section 6 related research work ispresented Finally Section 7 concludes the paperand points out the possible future extensions andimprovement of the architecture
1 lsquolsquoDaQuinCISmdashMethodologies and Tools for Data Quality
inside Cooperative Information Systemsrsquorsquo (httpwwwdis
uniroma1it~dq) is a joint research project carried out by
DISmdashUniversita di Roma lsquolsquoLa Sapienzarsquorsquo DISComdashUniversita
di Milano lsquolsquoBicoccarsquorsquo and DEImdashPolitecnico di Milano
2 A cooperative framework for data quality
In current government and business scenariosorganizations start cooperating in order to offerservices to their customers and partners Organi-zations that cooperate have business links (ierelationships exchanged documents resourcesknowledge etc) connecting each other Specifi-cally organizations exploit business services (egthey exchange data or require services to be carriedout) on the basis of business links and thereforethe network of organizations and business linksconstitutes a cooperative business system
As an example a supply chain in which someenterprises offer basic products and some others
assemble them in order to deliver final products tocustomers is a cooperative business system Asanother example a set of public administrationswhich need to exchange information about citizensand their health state in order to provide socialaids is a cooperative business system derived fromthe Italian e-Government scenario [2]
A cooperative business system exists indepen-dently of the presence of a software infrastructuresupporting electronic data exchange and serviceprovisioning Indeed cooperative information sys-tems are software systems supporting cooperativebusiness systems in the remaining of this paperthe following definition of CIS is considered
A cooperative information system is formed by a
set of organizations fOrg1yOrgng that cooperate
through a communication infrastructure N which
provide software services to organizations as wells
as reliable connectivity Each organization Orgi is
connected to N through a gateway Gi on which
software services offered by Orgi to other organiza-
tions are deployed A user is a software or human
entity residing within an organization and using the
cooperative systemSeveral CISrsquos are characterized by a high degree
of data replicated in different organizations as anexample in an e-Government scenario the perso-nal data of a citizen are stored by almost alladministrations But in such scenarios the differ-ent organizations can provide the same data withdifferent quality levels thus any user of data mayappreciate to exploit the data with the highestquality level among the provided ones
Therefore only the highest quality data shouldbe returned to the user limiting the disseminationof low quality data Moreover the comparison ofthe gathered data values might be used to enforcea general improvement of data quality in allorganizations
In the context of the DaQuinCIS project1 weare proposing an architecture for the managementof data quality in CISrsquos this architecture allows
ARTICLE IN PRESS
CooperativeGateway
Quality NotificationService (QNS)
Data QualityBroker (DQB)
QualityFactory (QF)
back-end systems
internals of theorganization
Org1
CooperativeGateway
Org2
CooperativeGateway
Org2
CommunicationInfrastructure
CooperativeGateway
Orgn
CooperativeGateway
Orgn
RatingService
CooperativeGateway
Quality NotificationService (QNS)
Data QualityBroker (DQB)
QualityFactory (QF)
back-end systems
internals of theorganization
Org1
CooperativeGateway
Quality NotificationService (QNS)
Data QualityBroker (DQB)
QualityFactory (QF)
back-end systems
internals of theorganization
Org1
CooperativeGateway
Org2
CooperativeGateway
Org2
CommunicationInfrastructure
CooperativeGateway
OrgnOrgn
RatingService
CooperativeGateway
Fig 2 The DaQuinCIS architecture
M Scannapieco et al Information Systems 29 (2004) 551ndash582554
the diffusion of data and related quality andexploits data replication to improve the overallquality of cooperative data Each organizationoffers services to other organizations on its owncooperative gateway and also specific services toits internal back-end systems Therefore coopera-tive gateways interface both internally and ex-ternally through services Moreover thecommunication infrastructure itself offers somespecific services Services are all identical and peerie they are instances of the same softwareartifacts and act both as servers and clients ofthe other peers depending of the specific activitiesto be carried out The overall architecture isdepicted in Fig 2
Organizations export data and quality dataaccording to a common model referred to asData and Data Quality ethD2QTHORN model It includesthe definitions of (i) constructs to represent data(ii) a common set of data quality properties (iii)constructs to represent them and (iv) the associa-tion between data and quality data The D2Q
model is described in Section 3In order to produce data and quality data
according to the D2Q model each organizationdeploys on its cooperative gateway a Quality
Factory service that is responsible for evaluatingthe quality of its own data The design of the
Quality Factory is out of the scope of this paperand has been addressed in [7]
The Data Quality Broker poses on behalf of arequesting user a data request over other co-operating organizations also specifying a set ofquality requirements that the desired data have tosatisfy this is referred to as quality brokering
function Different copies of the same data receivedas responses to the request are reconciled and abest-quality value is selected and proposed toorganizations that can choose to discard their dataand to adopt higher quality ones this is referred toas quality improvement function
The Data Quality Broker is in essence a peer-to-peer data integration system [8] which allows topose quality-enhanced query over a global schemaand to select data satisfying such requirementsThe Data Quality Broker is described in Section 4
The Quality Notification Service is a publishsubscribe engine used as a quality message busbetween services andor organizations Morespecifically it allows quality-based subscriptionsfor users to be notified on changes of the quality ofdata For example an organization may want tobe notified if the quality of some data it usesdegrades below a certain threshold or when highquality data are available The Quality Notifica-tion Service is described in Section 5 Also the
ARTICLE IN PRESS
-BusinessName-Code
Business
-FiscalCode-Address-Name
Owner
1
(a)
-FiscalCode-Address-Name
Owner
-EnterpriseName-Code
Enterprise
1
(b)
-EntName-Code-LegalAddress
Enterprise
-FiscalCode-Address-Name
Holder
1
(c)
Fig 3 Local schemas used in the running example conceptual
representation (a) Local schema of INAIL (b) local schema of
INPS (c) Local schema of CoC
M Scannapieco et al Information Systems 29 (2004) 551ndash582 555
Quality Notification Service is deployed as a peer-to-peer system
The Rating Service associates trust values toeach data source in the CIS These values are usedto determine the reliability of the quality evalua-tion performed by organizations The RatingService is a centralized service to be provided bya third-party organization The design of theRating Service is out of the scope of this paperin this paper for sake of simplicity we assume thatall organizations are reliable ie provide trustablevalues with respect to quality of data Theinterested reader can find further details on theRating Service design and implementation in [9]
21 Running example
We use a running example in order to show theproposed ideas The example is a simplifiedscenario derived from the Italian Public Adminis-tration setting (interested reader can refer to [10]for a precise description of the real worldscenario) Specifically three agencies namely theSocial Security Agency referred to as IstitutoNazionale Previdenza Sociale (INPS) the Acci-dent Insurance Agency referred to as IstitutoNazionale Assistenza Infortuni sul Lavoro (IN-AIL) and the Chambers of Commerce referred toas CoC (Camere di Commercio) together offer asuite of services designed to help businesses andenterprises to fulfill their obligations towards thePublic Administration Typically businesses andenterprises must provide information to the threeagencies when well-defined business events2 occurduring their lifetime and they must be able tosubmit enquiries to the agencies regarding variousadministrative issues
In order to provide such services different dataare managed by the three agencies some data areagency-specific information about businesses (egemployees social insurance taxes tax reportsbalance sheets) whereas others are information
2Examples of business events are the starting of a new
businesscompany or the closing down variations in legal
status board composition and senior management variations
in the number of employees as well as opening a new location
filling for a patent etc
that is common to all of them Common itemsinclude (i) features characterizing the businessincluding one or more identifiers headquarter andbranches addresses legal form main economicactivity number of employees and contractorsinformation about the owners or partners and (ii)milestone dates including date of business start-upand (possible) date of cessation
Each agency makes a different use of differentpieces of the common information As a conse-quence each agency enforces different types ofquality control that are deemed adequate for thelocal use of the information Because each businessreports the common data independently to eachagency the copies maintained by the threeagencies may have different levels of quality
In order to improve the quality of the data andthus to improve the service level offered tobusinesses and enterprises the agencies can deploya cooperative information system according to theDaQuinCIS architecture therefore each agencyexports its data according to a local schema InFig 3 three simplified versions of the localschemas are shown
3 The D2Q model
According to the DaQuinCIS architecture allcooperating organizations export their application
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582556
data and quality data (ie data quality dimensionvalues evaluated for the application data) accord-ing to a specific data model The model forexporting data and quality data is referred to asData and Data Quality ethD2QTHORN model In thissection we first introduce the data quality dimen-sions used in the model (Section 31) then wedescribe the D2Q model with respect to the datafeatures (Section 32) and the quality features(Section 33) Finally we describe implementationissues of the D2Q model in Section 35
31 Data quality dimensions
Data quality dimensions are properties of datasuch as correctness or degree of updating Thedata quality dimensions used in this work concernonly data values instead they do not deal withaspects concerning quality of logical schema anddata format [6]
In this section we propose some data qualitydimensions to be used in CISrsquos The need forproviding such definitions stems from the lack of acommon reference set of dimensions in the dataquality literature [5] and from the peculiarities ofCISrsquos we propose four dimensions inspired by thealready proposed definitions and stemming fromreal requirements of CISrsquos scenarios that weexperienced [2]
In the following the general concept of schema
element is used corresponding for instance to anentity in an Entity-Relationship schema or to aclass in a Unified Modeling Language diagramWe define (i) accuracy (ii) completeness (iii)currency and (iv) consistency
311 Accuracy
In [6] accuracy refers to the proximity of a valuev to a value v0 considered as correct Morespecifically
Accuracy is the distance between v and v0 being v0
the value considered as correctLet us consider the following example Citizen
is a schema element with an attribute Name and p
is an instance of Citizen If pName has a valuev frac14 JHN while v0 frac14 JOHN this is a case of lowaccuracy as JHN is not an admissible valueaccording to a dictionary of English names
Accuracy can be checked by comparing datavalues with reference dictionaries (eg namedictionaries address lists domain related diction-aries such as product or commercial categorieslists) As an example in the case of values on stringdomain edit distance functions can be used tosupport such a task [11] such functions considerthe minimum number of operations on individualcharacters needed in order to transform a stringinto another
Accuracy is more difficult to measure fornumerical data values for which more complexnotions of edit distances need to be defined
312 Completeness
Completeness is the degree to which values of a
schema element are present in the schema element
instanceAccording to such a definition we can consider
(i) the completeness of an attribute value asdependent from the fact that the attribute ispresent or not (ii) the completeness of a schemaelement instance as dependent from the number ofthe attribute values that are present
A recent work [12] distinguishes other kinds ofcompleteness namely schema completeness col-umn completeness and population completenessThe measures of such completeness types can givevery useful information for a lsquolsquogeneralrsquorsquo assessmentof the data completeness of an organization butas it will be clarified in Section 33 our focus inthis paper is on associating a quality evaluation onlsquolsquoeachrsquorsquo data a cooperating organization makesavailable to the others
In evaluating completeness it is important toconsider the meaning of null values of an attributedepending on the attribute being mandatoryoptional or inapplicable a null value for amandatory attribute is associated with a lowercompleteness whereas completeness is not affectedby optional or inapplicable null values
As an example consider the attribute E-mail ofthe Citizen schema element A null value forE-mail may have different meanings First thespecific citizen may have no e-mail address andtherefore the attribute is inapplicable this case hasno impact on completeness Second the specific
ARTICLE IN PRESS
3Basic types are the ones provided by the most common
programming languages and SQL that is Integer Real
Boolean String Date Time Interval Currency Any
M Scannapieco et al Information Systems 29 (2004) 551ndash582 557
citizen has an e-mail address which has not beenstored this is a case of low completeness
313 Currency
The currency dimension refers only to datavalues that may vary in time as an example valuesof Address may vary in time whereas DateOf-
Birth can be considered invariant Currency canbe defined as
Currency is the distance between the instant when
a value changes in the real world and the instant
when the value itself is modified in the information
systemMore complex definitions of currency have been
proposed in the literature [13] they are especiallysuited for specific types of data ie they take intoaccount elements such as lsquolsquovolatilityrsquorsquo of data (egdata such as stock price quotes that change veryfrequently) However in this work we stay withthe provided general definition of currency leavingits extensions to future work
Currency can be measured either by associatingto each value an lsquolsquoupdating time-stamprsquorsquo [14] or alsquolsquotransaction timersquorsquo like in temporal databases [15]
314 Consistency
Consistency implies that two or more values donot conflict each other
A semantic rule is a constraint that must holdamong values of attributes of a schema elementdepending on the application domain modeled bythe schema element Consistency can be defined as
Consistency is the degree to which the values of
the attributes of an instance of a schema element
satisfy the specific set of semantic rules defined on
the schema elementAs an example if we consider Citizen with
attributes Name DateOfBirth Sex and DateOf-
Death some possible semantic rules to be checkedare
the values of Name and Sex for an instance p areconsistent If pName has a value v frac14 JOHN andthe value of pSex is FEMALE this is a case oflow internal consistency
the value of pDateOfBirth must precede thevalue of pDateOfDeath
Consistency has been widely investigated both inthe database area through integrity constraints(eg [16]) and in the statistics area through editschecking (eg [17])
32 The D2Q data model
The D2Q model is a semistructured model thatenhances the semantics of the XML data model[18] in order to represent quality data In thefollowing of this section we provide a formaldefinition for a data class and a D2Q data schemaD2Q data schemas define the data portion of theD2Q model instead in Section 33 we define D2Q
quality schemas corresponding to the qualityportion of the D2Q model
Definition 31 (Data class) A data class
d ethnamedp1y pnTHORN consists of
a name named a set of properties pi frac14 namei typeiS i frac14
1y n nX1 where namei is the name of theproperty pi and typei can be
3
either a basic type3
3
or a data class
3
or a type set-of XS where XS can beeither a basic type or a data class
Definition 32 (D2Q Data Schema) A D2Q dataschema SD is a direct node- and edge-labelledgraph with the following features
a set of nodes ND each of which is a data class a set of leaves T D that are all basic type
properties a set of edges EDDND ethNDT D) a set of nodes and leaves labels LD frac14
LNDLT D where LND is the set of node labelseach node label consisting of dataclass name data classS LT D is the set ofleaves labels each leaf label consisting ofproperty name basic typeS
ARTICLE IN PRESS
EnterpriseEnterpriseClass
OwnerOwnerClass
Namestring Codestring
1 1
M Scannapieco et al Information Systems 29 (2004) 551ndash582558
a set of edge labels LED defined as follows Letn1 n2SAED Let n1 be a data class and let l12be the label on the edge n1 n2S Such a label isl12 frac14number of occurrences of n2 ie thelabel specifies the cardinality of n2 in therelationship with n1
4
FiscalCodestring
Addressstring Namestring
11
1
Fig 4 A D2Q data schema of the running example
A D2Q data schema must have the twofollowing properties
Closure Property If SD contains a data class dhaving as a property another data class d0 thenalso d0 belongs to SD
Aciclicity No cycle exists in a D2Q schemaIn Fig 4 the D2Q data model of the INPS local
schema of the running example is shown Specifi-cally two data classes enterprise and owner arerepresented The edge connecting enterprise toowner is labelled with indicating the n-arycardinality of owner the edges connecting eachdata class with the related basic type properties arelabelled with 1 indicating the cardinality of suchproperties
33 The D2Q quality model
This section describes the model we use torepresent quality data ie the D2Q quality modelfor which we define the concepts of quality class
and quality schemaBesides basic data types we define a set of types
that are called quality types Quality types corre-spond to the set of values that quality dimensionscan have We indicate as
-
4W
for m
01 1x and
1minim
taccuracy tcompleteness tconsistency tcurrency
the quality types considered in our modelMore specifically each quality type is defined
over a domain of values referred to as DomaccuracyDomcompleteness Domconsistency and Domcurrency Suchdomains are defined according to the metrics usedto measure quality values moreover each domainis enhanced with the special symbol gt to indicate
e assume that it is possible to specify the following values
inimum and maximum cardinality of relationships 11 (where stands for n-ary relationship) 0 xy (where
y are any fixed integer different from 0 and 1) 0y 1yIf no minimum cardinality is specified it is assumed the
um cardinality is 1
that no quality value is defined for that dimensionFor instance if we adopt a 3-value scale tomeasure accuracy then Domaccuracy frac14 fHighMedium Lowgtg
A quality class is associated to each data classand to each property We use the followingnotation if d ethnamed p1ypiypnTHORN is a data classthen ld is the associated quality class and ldpi
isthe quality class associated to pi In the followingwe will first define ldpi
where pi is a basic typeproperty then we define the quality class ld
Definition 33 (ldpi-Quality class associated to a
(basic type) property of a data class) Letd ethnamedp1y piypnTHORN be a data class Let pi
be a basic type property of dThe quality class ldpi
is defined by
a name nameldpi
four properties kaccuracy kcompleteness kconsistencykcurrency such that ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN
Definition 34 (ld-Quality class associated to adata class) Let d ethnamed p1ypnTHORN be adata class A quality class ld ethnameld kaccuracykcompletenesskconsistency kcurrencypl1yplnTHORN consists of
a name nameld a set of properties ethkaccuracykcompleteness
kconsistencykcurrencypl1y plnTHORN such that
3
kaccuracy kcompleteness kconsistency kcurrency are suchthat ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 559
3
ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as
ndash
if pi is a basic type property then typeli is
ldpi where ldpi
is the quality class asso-ciated to d pi
ndash
if pi is a set of XS type where X is abasic type then typeli is set of ldpi
Swhere ldpi
is the quality class associated tod pi
ndash
if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0
ndash
if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0
Definition 35 (D2Q Quality Schema) A D2Q
quality schema SQ is a direct node- and edge-
labelled graph with the following features
a set of nodes NQ each of which is a qualityclass
a set of leaves T Q that are quality typeproperties
a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q
where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels
Enterprise_Quality
Owner_QualityOwnerQualityClass
FiscalCode_QualityFiscalCodeQualityClass
Address_QualityAddressQualityClass
QualityN
Owner_Accuracyτ
hellip
hellip hellip
FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy
EnterprAccur
1 1 1
accuracy
Fig 5 A D2Q quality schema
consisting of quality property namequality typeS
a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows
3
Ente
Namame
a
hellipise_acy
of
if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality
3
if n2 is a quality class then l12 frac14number
of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1
Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality
are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the
rpriseQualityClass
e_QualityClass
Name_QualityName_QualityClass
Name_ccuracy
Code_QualityCodeQualityClass
hellip
hellip
hellip
Code_Accuracyτ
Name_Accuracyτ
1
1
accuracy
accuracy
the running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582560
domains of data quality dimension types to such ascope
34 The D2Q model data + quality
In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q
data schema and the D2Q quality schemaSpecifically we define the notion of quality
association between nodes of the two graphs
Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q
data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ
A quality association is a functionqualityAss A-B such that
(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function
It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions
Enterprise
Owner
NameCode
Owner_Quality
quality
qualityhelliphellip
quality
quality
1 1
Fig 6 A D2Q schema of t
Definition 37 (D2Q Schema) Given the D2Q
data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures
a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ
a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where
EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ
a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ
where LEDQ frac14 fqualityg is a set of quality
labels
The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)
341 D2Q schema instances
The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q
model
Enterprise_Quality
Name_Quality
Enterprise_accuracy
Name_accuracy
Code_Quality
Code_accuracy
1
1
1
hellip
hellip
hellip
he running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 561
More specifically
data classes instances are data objects similarly quality classes instances are quality
objects quality association values corresponds to qual-
ity links
A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type
property name basic type property valueSEdges are not labelled at all
Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type
property name quality type property valueSEdges are not labelled at all
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
Fig 7 D2Q schema instance
Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label
Fig 7 shows the graph representation of a D2Q
schema instance of the running example TheEnterprise1 object instance of the Enterprise
class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)
35 Implementation of the D2Q model
In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
of the running example
ARTICLE IN PRESS
Fig 8 An example of definition of a quality type as an XML
Schema type
M Scannapieco et al Information Systems 29 (2004) 551ndash582562
The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q
schema instances as XML documents is alsodescribed in the following of this section
351 From a D2Q schema to an XML schema
D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]
Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them
Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate
The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be
identified by explicitly introducing Object Identi-fiers (OIDs)
Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects
The principal choices for the representation ofD2Q schemas as XML Schemas were
representing data and quality classes and alsotheir properties as XML elements
limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association
definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model
definition of a root element for each D2Q dataschema
definition of a root element for each D2Q
quality schema
An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types
The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion
352 From a D2Q schema instance to an XML file
The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one
ARTICLE IN PRESS
Fig 9 XML Schema of the D2Q data schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 563
is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules
a root element is defined to contain all dataobjects
a root element is defined to contain all qualityobjects
an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality
a property of a data class is mapped into anXML element
similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it
a property of a quality class is also mapped intoan XML element
We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9
4 The Data Quality Broker
The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation
41 Overview
The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q
global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints
ARTICLE IN PRESS
Fig 10 XML Schema of the D2Q quality schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582564
Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway
in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried
A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater
ARTICLE IN PRESS
Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 565
mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)
The Data Quality Broker performs two specifictasks
query processing and quality improvement
The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43
42 Query processing
The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q
model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved
Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582566
and we select the copy to return as a result on thebasis of such values
Specifically the detailed steps we follow toanswer a query with quality constraints are
(1)
Table
Map
Glob
Enter
Enter
Enter
Hold
Hold
y
y
Let Q be a query posed on the global schemaG
(2)
The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources
(3)
-Name-ID-LegalAddress
Enterprise
The execution of the queries Q1yQn
returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]
(4)
-FiscalCode-Address-Name
Holder
1
Fig 12 Graphical representation of the global schema
The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values
1
ping between global and local schemas
al schema INPS schema
prise Enterprise
priseID EnterpriseCode
priseLegalAddress X
er Owner
erName OwnerName
y
y
421 Example of query processing
In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1
Let us consider the following XQuery query tobe posed on the global schema
for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)
Enterprise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-
INAIL schema CoC schema
Business Enterprise
BusinessCode EnterpriseCode
X EnterpriseLegalAddress
Owner Holder
OwnerName HolderName
y y
y y
ARTICLE IN PRESS
Accuracy
item
2
Acc
Name=
medium
Acc
Address
frac14medium
Acc
Namefrac14
high
Italyrsquorsquo
Acc
Address
frac14high
Acc
Namefrac14
low
Acc
Address
frac14low
M Scannapieco et al Information Systems 29 (2004) 551ndash582 567
sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL
iecorsquorsquo
84016Paganirsquorsquo
piecorsquorsquo
84016Pagani(SA)
Italyrsquorsquo
for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-
prise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
Table
2
Queryresultsin
asample
scenario
Organization
Holderitem
1Accuracy
item
1Holderitem
2
INAIL
lsquolsquoMaim
oMecellarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanap
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
INPS
lsquolsquoMMecallarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanna
lsquolsquoVia
dei
Gracchi71RomaItalyrsquorsquo
Acc
Address
frac14medium
lsquolsquoVia
Mazzini27
CoC
lsquolsquoMassim
oMecellarsquorsquo
Acc
Namefrac14
high
lsquolsquoMScannarsquorsquo
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high
Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data
Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582552
Uncertified quality can also cause a deteriora-tion of the data quality inside single organizationsIf organizations exchange data without knowingtheir actual quality it may happen that data of lowquality spread all over the CIS On the other handCISrsquos are characterized by high data replicationie different copies of the same data are stored bydifferent organizations From a data qualityperspective this is a great opportunity improve-ment actions can be carried out on the basis ofcomparisons among different copies in ordereither to select the most appropriate one or toreconcile available copies thus producing a newimproved copy to be notified to all interestedorganizations
In this paper we propose an architecture formanaging data quality in CISrsquos The architectureaims to avoid dissemination of low qualified datathrough the CIS by providing a support for dataquality diffusion and improvement To the best ofour knowledge this is a novel contribution in theinformation quality area that aims at integratingin the specific context of CISrsquos both modeling andarchitectural issues
The TDQM CIS methodological cycle [3] de-fines the methodological framework in which the
Fig 1 The phases of the TDQM CIS cycle grey p
proposed architecture fits The TDQM CIS cyclederives from the extension of the TDQM cycle [4]to the context of CISrsquos it consists of the fivephases of Definition Measurement ExchangeAnalysis and Improvement (see Fig 1)
The Definition phase implies the definition of amodel for the data exported by each cooperatingorganization and for the quality data associated tothem Quality is a multi-dimensional concept data
quality dimensions characterize properties that areinherent to data [56] Both data and quality datacan be exported as XML documents as furtherdetailed in Section 3 The Measurement phaseconsists of the evaluation of the data qualitydimensions for the exported data The Exchange
phase implies the exact definition of the exchangedinformation consisting of data and of appropriatequality data corresponding to values of dataquality dimensions The Analysis phase regardsthe interpretation of the quality values containedin the exchanged information by the organizationthat receives it Finally the Improvement phaseconsists of all the actions that enable improve-ments of cooperative data by exploiting theopportunities offered by the cooperative environ-ment An example of improvement action is based
hases are specifically considered in this work
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 553
on the analysis phase described above interpretingthe quality of exchanged information gives theopportunity of sending accurate feedbacks to datasource organizations which can then implementcorrection actions to improve their quality
The architecture presented in this paper consistsof different modules supporting each of thedescribed phases Nevertheless in this paper wefocus on two specific modules namely the DataQuality Broker and the Quality NotificationService which mainly support the Exchange andImprovement phases of the TDQM CIS cycle Amodel specifically addressing the Definition phaseis also discussed In Fig 1 the phases of theTDQM CIS cycle specifically addressed in thispaper are grey-colored
The structure of the paper is as follows InSection 2 a framework for Cooperative Informa-tion Systems is described and the proposedarchitecture is outlined furthermore a runningexample for exemplifying the contributions isintroduced the example stems from the Italiane-Government scenario In Section 3 the modelfor both the exchanging of data and their quality ispresented In Section 4 the Data Quality Broker isdescribed and its distributed design is discussed InSection 5 the Quality Notification Service isillustrated In Section 6 related research work ispresented Finally Section 7 concludes the paperand points out the possible future extensions andimprovement of the architecture
1 lsquolsquoDaQuinCISmdashMethodologies and Tools for Data Quality
inside Cooperative Information Systemsrsquorsquo (httpwwwdis
uniroma1it~dq) is a joint research project carried out by
DISmdashUniversita di Roma lsquolsquoLa Sapienzarsquorsquo DISComdashUniversita
di Milano lsquolsquoBicoccarsquorsquo and DEImdashPolitecnico di Milano
2 A cooperative framework for data quality
In current government and business scenariosorganizations start cooperating in order to offerservices to their customers and partners Organi-zations that cooperate have business links (ierelationships exchanged documents resourcesknowledge etc) connecting each other Specifi-cally organizations exploit business services (egthey exchange data or require services to be carriedout) on the basis of business links and thereforethe network of organizations and business linksconstitutes a cooperative business system
As an example a supply chain in which someenterprises offer basic products and some others
assemble them in order to deliver final products tocustomers is a cooperative business system Asanother example a set of public administrationswhich need to exchange information about citizensand their health state in order to provide socialaids is a cooperative business system derived fromthe Italian e-Government scenario [2]
A cooperative business system exists indepen-dently of the presence of a software infrastructuresupporting electronic data exchange and serviceprovisioning Indeed cooperative information sys-tems are software systems supporting cooperativebusiness systems in the remaining of this paperthe following definition of CIS is considered
A cooperative information system is formed by a
set of organizations fOrg1yOrgng that cooperate
through a communication infrastructure N which
provide software services to organizations as wells
as reliable connectivity Each organization Orgi is
connected to N through a gateway Gi on which
software services offered by Orgi to other organiza-
tions are deployed A user is a software or human
entity residing within an organization and using the
cooperative systemSeveral CISrsquos are characterized by a high degree
of data replicated in different organizations as anexample in an e-Government scenario the perso-nal data of a citizen are stored by almost alladministrations But in such scenarios the differ-ent organizations can provide the same data withdifferent quality levels thus any user of data mayappreciate to exploit the data with the highestquality level among the provided ones
Therefore only the highest quality data shouldbe returned to the user limiting the disseminationof low quality data Moreover the comparison ofthe gathered data values might be used to enforcea general improvement of data quality in allorganizations
In the context of the DaQuinCIS project1 weare proposing an architecture for the managementof data quality in CISrsquos this architecture allows
ARTICLE IN PRESS
CooperativeGateway
Quality NotificationService (QNS)
Data QualityBroker (DQB)
QualityFactory (QF)
back-end systems
internals of theorganization
Org1
CooperativeGateway
Org2
CooperativeGateway
Org2
CommunicationInfrastructure
CooperativeGateway
Orgn
CooperativeGateway
Orgn
RatingService
CooperativeGateway
Quality NotificationService (QNS)
Data QualityBroker (DQB)
QualityFactory (QF)
back-end systems
internals of theorganization
Org1
CooperativeGateway
Quality NotificationService (QNS)
Data QualityBroker (DQB)
QualityFactory (QF)
back-end systems
internals of theorganization
Org1
CooperativeGateway
Org2
CooperativeGateway
Org2
CommunicationInfrastructure
CooperativeGateway
OrgnOrgn
RatingService
CooperativeGateway
Fig 2 The DaQuinCIS architecture
M Scannapieco et al Information Systems 29 (2004) 551ndash582554
the diffusion of data and related quality andexploits data replication to improve the overallquality of cooperative data Each organizationoffers services to other organizations on its owncooperative gateway and also specific services toits internal back-end systems Therefore coopera-tive gateways interface both internally and ex-ternally through services Moreover thecommunication infrastructure itself offers somespecific services Services are all identical and peerie they are instances of the same softwareartifacts and act both as servers and clients ofthe other peers depending of the specific activitiesto be carried out The overall architecture isdepicted in Fig 2
Organizations export data and quality dataaccording to a common model referred to asData and Data Quality ethD2QTHORN model It includesthe definitions of (i) constructs to represent data(ii) a common set of data quality properties (iii)constructs to represent them and (iv) the associa-tion between data and quality data The D2Q
model is described in Section 3In order to produce data and quality data
according to the D2Q model each organizationdeploys on its cooperative gateway a Quality
Factory service that is responsible for evaluatingthe quality of its own data The design of the
Quality Factory is out of the scope of this paperand has been addressed in [7]
The Data Quality Broker poses on behalf of arequesting user a data request over other co-operating organizations also specifying a set ofquality requirements that the desired data have tosatisfy this is referred to as quality brokering
function Different copies of the same data receivedas responses to the request are reconciled and abest-quality value is selected and proposed toorganizations that can choose to discard their dataand to adopt higher quality ones this is referred toas quality improvement function
The Data Quality Broker is in essence a peer-to-peer data integration system [8] which allows topose quality-enhanced query over a global schemaand to select data satisfying such requirementsThe Data Quality Broker is described in Section 4
The Quality Notification Service is a publishsubscribe engine used as a quality message busbetween services andor organizations Morespecifically it allows quality-based subscriptionsfor users to be notified on changes of the quality ofdata For example an organization may want tobe notified if the quality of some data it usesdegrades below a certain threshold or when highquality data are available The Quality Notifica-tion Service is described in Section 5 Also the
ARTICLE IN PRESS
-BusinessName-Code
Business
-FiscalCode-Address-Name
Owner
1
(a)
-FiscalCode-Address-Name
Owner
-EnterpriseName-Code
Enterprise
1
(b)
-EntName-Code-LegalAddress
Enterprise
-FiscalCode-Address-Name
Holder
1
(c)
Fig 3 Local schemas used in the running example conceptual
representation (a) Local schema of INAIL (b) local schema of
INPS (c) Local schema of CoC
M Scannapieco et al Information Systems 29 (2004) 551ndash582 555
Quality Notification Service is deployed as a peer-to-peer system
The Rating Service associates trust values toeach data source in the CIS These values are usedto determine the reliability of the quality evalua-tion performed by organizations The RatingService is a centralized service to be provided bya third-party organization The design of theRating Service is out of the scope of this paperin this paper for sake of simplicity we assume thatall organizations are reliable ie provide trustablevalues with respect to quality of data Theinterested reader can find further details on theRating Service design and implementation in [9]
21 Running example
We use a running example in order to show theproposed ideas The example is a simplifiedscenario derived from the Italian Public Adminis-tration setting (interested reader can refer to [10]for a precise description of the real worldscenario) Specifically three agencies namely theSocial Security Agency referred to as IstitutoNazionale Previdenza Sociale (INPS) the Acci-dent Insurance Agency referred to as IstitutoNazionale Assistenza Infortuni sul Lavoro (IN-AIL) and the Chambers of Commerce referred toas CoC (Camere di Commercio) together offer asuite of services designed to help businesses andenterprises to fulfill their obligations towards thePublic Administration Typically businesses andenterprises must provide information to the threeagencies when well-defined business events2 occurduring their lifetime and they must be able tosubmit enquiries to the agencies regarding variousadministrative issues
In order to provide such services different dataare managed by the three agencies some data areagency-specific information about businesses (egemployees social insurance taxes tax reportsbalance sheets) whereas others are information
2Examples of business events are the starting of a new
businesscompany or the closing down variations in legal
status board composition and senior management variations
in the number of employees as well as opening a new location
filling for a patent etc
that is common to all of them Common itemsinclude (i) features characterizing the businessincluding one or more identifiers headquarter andbranches addresses legal form main economicactivity number of employees and contractorsinformation about the owners or partners and (ii)milestone dates including date of business start-upand (possible) date of cessation
Each agency makes a different use of differentpieces of the common information As a conse-quence each agency enforces different types ofquality control that are deemed adequate for thelocal use of the information Because each businessreports the common data independently to eachagency the copies maintained by the threeagencies may have different levels of quality
In order to improve the quality of the data andthus to improve the service level offered tobusinesses and enterprises the agencies can deploya cooperative information system according to theDaQuinCIS architecture therefore each agencyexports its data according to a local schema InFig 3 three simplified versions of the localschemas are shown
3 The D2Q model
According to the DaQuinCIS architecture allcooperating organizations export their application
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582556
data and quality data (ie data quality dimensionvalues evaluated for the application data) accord-ing to a specific data model The model forexporting data and quality data is referred to asData and Data Quality ethD2QTHORN model In thissection we first introduce the data quality dimen-sions used in the model (Section 31) then wedescribe the D2Q model with respect to the datafeatures (Section 32) and the quality features(Section 33) Finally we describe implementationissues of the D2Q model in Section 35
31 Data quality dimensions
Data quality dimensions are properties of datasuch as correctness or degree of updating Thedata quality dimensions used in this work concernonly data values instead they do not deal withaspects concerning quality of logical schema anddata format [6]
In this section we propose some data qualitydimensions to be used in CISrsquos The need forproviding such definitions stems from the lack of acommon reference set of dimensions in the dataquality literature [5] and from the peculiarities ofCISrsquos we propose four dimensions inspired by thealready proposed definitions and stemming fromreal requirements of CISrsquos scenarios that weexperienced [2]
In the following the general concept of schema
element is used corresponding for instance to anentity in an Entity-Relationship schema or to aclass in a Unified Modeling Language diagramWe define (i) accuracy (ii) completeness (iii)currency and (iv) consistency
311 Accuracy
In [6] accuracy refers to the proximity of a valuev to a value v0 considered as correct Morespecifically
Accuracy is the distance between v and v0 being v0
the value considered as correctLet us consider the following example Citizen
is a schema element with an attribute Name and p
is an instance of Citizen If pName has a valuev frac14 JHN while v0 frac14 JOHN this is a case of lowaccuracy as JHN is not an admissible valueaccording to a dictionary of English names
Accuracy can be checked by comparing datavalues with reference dictionaries (eg namedictionaries address lists domain related diction-aries such as product or commercial categorieslists) As an example in the case of values on stringdomain edit distance functions can be used tosupport such a task [11] such functions considerthe minimum number of operations on individualcharacters needed in order to transform a stringinto another
Accuracy is more difficult to measure fornumerical data values for which more complexnotions of edit distances need to be defined
312 Completeness
Completeness is the degree to which values of a
schema element are present in the schema element
instanceAccording to such a definition we can consider
(i) the completeness of an attribute value asdependent from the fact that the attribute ispresent or not (ii) the completeness of a schemaelement instance as dependent from the number ofthe attribute values that are present
A recent work [12] distinguishes other kinds ofcompleteness namely schema completeness col-umn completeness and population completenessThe measures of such completeness types can givevery useful information for a lsquolsquogeneralrsquorsquo assessmentof the data completeness of an organization butas it will be clarified in Section 33 our focus inthis paper is on associating a quality evaluation onlsquolsquoeachrsquorsquo data a cooperating organization makesavailable to the others
In evaluating completeness it is important toconsider the meaning of null values of an attributedepending on the attribute being mandatoryoptional or inapplicable a null value for amandatory attribute is associated with a lowercompleteness whereas completeness is not affectedby optional or inapplicable null values
As an example consider the attribute E-mail ofthe Citizen schema element A null value forE-mail may have different meanings First thespecific citizen may have no e-mail address andtherefore the attribute is inapplicable this case hasno impact on completeness Second the specific
ARTICLE IN PRESS
3Basic types are the ones provided by the most common
programming languages and SQL that is Integer Real
Boolean String Date Time Interval Currency Any
M Scannapieco et al Information Systems 29 (2004) 551ndash582 557
citizen has an e-mail address which has not beenstored this is a case of low completeness
313 Currency
The currency dimension refers only to datavalues that may vary in time as an example valuesof Address may vary in time whereas DateOf-
Birth can be considered invariant Currency canbe defined as
Currency is the distance between the instant when
a value changes in the real world and the instant
when the value itself is modified in the information
systemMore complex definitions of currency have been
proposed in the literature [13] they are especiallysuited for specific types of data ie they take intoaccount elements such as lsquolsquovolatilityrsquorsquo of data (egdata such as stock price quotes that change veryfrequently) However in this work we stay withthe provided general definition of currency leavingits extensions to future work
Currency can be measured either by associatingto each value an lsquolsquoupdating time-stamprsquorsquo [14] or alsquolsquotransaction timersquorsquo like in temporal databases [15]
314 Consistency
Consistency implies that two or more values donot conflict each other
A semantic rule is a constraint that must holdamong values of attributes of a schema elementdepending on the application domain modeled bythe schema element Consistency can be defined as
Consistency is the degree to which the values of
the attributes of an instance of a schema element
satisfy the specific set of semantic rules defined on
the schema elementAs an example if we consider Citizen with
attributes Name DateOfBirth Sex and DateOf-
Death some possible semantic rules to be checkedare
the values of Name and Sex for an instance p areconsistent If pName has a value v frac14 JOHN andthe value of pSex is FEMALE this is a case oflow internal consistency
the value of pDateOfBirth must precede thevalue of pDateOfDeath
Consistency has been widely investigated both inthe database area through integrity constraints(eg [16]) and in the statistics area through editschecking (eg [17])
32 The D2Q data model
The D2Q model is a semistructured model thatenhances the semantics of the XML data model[18] in order to represent quality data In thefollowing of this section we provide a formaldefinition for a data class and a D2Q data schemaD2Q data schemas define the data portion of theD2Q model instead in Section 33 we define D2Q
quality schemas corresponding to the qualityportion of the D2Q model
Definition 31 (Data class) A data class
d ethnamedp1y pnTHORN consists of
a name named a set of properties pi frac14 namei typeiS i frac14
1y n nX1 where namei is the name of theproperty pi and typei can be
3
either a basic type3
3
or a data class
3
or a type set-of XS where XS can beeither a basic type or a data class
Definition 32 (D2Q Data Schema) A D2Q dataschema SD is a direct node- and edge-labelledgraph with the following features
a set of nodes ND each of which is a data class a set of leaves T D that are all basic type
properties a set of edges EDDND ethNDT D) a set of nodes and leaves labels LD frac14
LNDLT D where LND is the set of node labelseach node label consisting of dataclass name data classS LT D is the set ofleaves labels each leaf label consisting ofproperty name basic typeS
ARTICLE IN PRESS
EnterpriseEnterpriseClass
OwnerOwnerClass
Namestring Codestring
1 1
M Scannapieco et al Information Systems 29 (2004) 551ndash582558
a set of edge labels LED defined as follows Letn1 n2SAED Let n1 be a data class and let l12be the label on the edge n1 n2S Such a label isl12 frac14number of occurrences of n2 ie thelabel specifies the cardinality of n2 in therelationship with n1
4
FiscalCodestring
Addressstring Namestring
11
1
Fig 4 A D2Q data schema of the running example
A D2Q data schema must have the twofollowing properties
Closure Property If SD contains a data class dhaving as a property another data class d0 thenalso d0 belongs to SD
Aciclicity No cycle exists in a D2Q schemaIn Fig 4 the D2Q data model of the INPS local
schema of the running example is shown Specifi-cally two data classes enterprise and owner arerepresented The edge connecting enterprise toowner is labelled with indicating the n-arycardinality of owner the edges connecting eachdata class with the related basic type properties arelabelled with 1 indicating the cardinality of suchproperties
33 The D2Q quality model
This section describes the model we use torepresent quality data ie the D2Q quality modelfor which we define the concepts of quality class
and quality schemaBesides basic data types we define a set of types
that are called quality types Quality types corre-spond to the set of values that quality dimensionscan have We indicate as
-
4W
for m
01 1x and
1minim
taccuracy tcompleteness tconsistency tcurrency
the quality types considered in our modelMore specifically each quality type is defined
over a domain of values referred to as DomaccuracyDomcompleteness Domconsistency and Domcurrency Suchdomains are defined according to the metrics usedto measure quality values moreover each domainis enhanced with the special symbol gt to indicate
e assume that it is possible to specify the following values
inimum and maximum cardinality of relationships 11 (where stands for n-ary relationship) 0 xy (where
y are any fixed integer different from 0 and 1) 0y 1yIf no minimum cardinality is specified it is assumed the
um cardinality is 1
that no quality value is defined for that dimensionFor instance if we adopt a 3-value scale tomeasure accuracy then Domaccuracy frac14 fHighMedium Lowgtg
A quality class is associated to each data classand to each property We use the followingnotation if d ethnamed p1ypiypnTHORN is a data classthen ld is the associated quality class and ldpi
isthe quality class associated to pi In the followingwe will first define ldpi
where pi is a basic typeproperty then we define the quality class ld
Definition 33 (ldpi-Quality class associated to a
(basic type) property of a data class) Letd ethnamedp1y piypnTHORN be a data class Let pi
be a basic type property of dThe quality class ldpi
is defined by
a name nameldpi
four properties kaccuracy kcompleteness kconsistencykcurrency such that ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN
Definition 34 (ld-Quality class associated to adata class) Let d ethnamed p1ypnTHORN be adata class A quality class ld ethnameld kaccuracykcompletenesskconsistency kcurrencypl1yplnTHORN consists of
a name nameld a set of properties ethkaccuracykcompleteness
kconsistencykcurrencypl1y plnTHORN such that
3
kaccuracy kcompleteness kconsistency kcurrency are suchthat ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 559
3
ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as
ndash
if pi is a basic type property then typeli is
ldpi where ldpi
is the quality class asso-ciated to d pi
ndash
if pi is a set of XS type where X is abasic type then typeli is set of ldpi
Swhere ldpi
is the quality class associated tod pi
ndash
if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0
ndash
if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0
Definition 35 (D2Q Quality Schema) A D2Q
quality schema SQ is a direct node- and edge-
labelled graph with the following features
a set of nodes NQ each of which is a qualityclass
a set of leaves T Q that are quality typeproperties
a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q
where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels
Enterprise_Quality
Owner_QualityOwnerQualityClass
FiscalCode_QualityFiscalCodeQualityClass
Address_QualityAddressQualityClass
QualityN
Owner_Accuracyτ
hellip
hellip hellip
FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy
EnterprAccur
1 1 1
accuracy
Fig 5 A D2Q quality schema
consisting of quality property namequality typeS
a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows
3
Ente
Namame
a
hellipise_acy
of
if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality
3
if n2 is a quality class then l12 frac14number
of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1
Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality
are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the
rpriseQualityClass
e_QualityClass
Name_QualityName_QualityClass
Name_ccuracy
Code_QualityCodeQualityClass
hellip
hellip
hellip
Code_Accuracyτ
Name_Accuracyτ
1
1
accuracy
accuracy
the running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582560
domains of data quality dimension types to such ascope
34 The D2Q model data + quality
In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q
data schema and the D2Q quality schemaSpecifically we define the notion of quality
association between nodes of the two graphs
Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q
data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ
A quality association is a functionqualityAss A-B such that
(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function
It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions
Enterprise
Owner
NameCode
Owner_Quality
quality
qualityhelliphellip
quality
quality
1 1
Fig 6 A D2Q schema of t
Definition 37 (D2Q Schema) Given the D2Q
data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures
a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ
a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where
EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ
a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ
where LEDQ frac14 fqualityg is a set of quality
labels
The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)
341 D2Q schema instances
The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q
model
Enterprise_Quality
Name_Quality
Enterprise_accuracy
Name_accuracy
Code_Quality
Code_accuracy
1
1
1
hellip
hellip
hellip
he running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 561
More specifically
data classes instances are data objects similarly quality classes instances are quality
objects quality association values corresponds to qual-
ity links
A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type
property name basic type property valueSEdges are not labelled at all
Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type
property name quality type property valueSEdges are not labelled at all
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
Fig 7 D2Q schema instance
Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label
Fig 7 shows the graph representation of a D2Q
schema instance of the running example TheEnterprise1 object instance of the Enterprise
class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)
35 Implementation of the D2Q model
In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
of the running example
ARTICLE IN PRESS
Fig 8 An example of definition of a quality type as an XML
Schema type
M Scannapieco et al Information Systems 29 (2004) 551ndash582562
The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q
schema instances as XML documents is alsodescribed in the following of this section
351 From a D2Q schema to an XML schema
D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]
Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them
Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate
The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be
identified by explicitly introducing Object Identi-fiers (OIDs)
Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects
The principal choices for the representation ofD2Q schemas as XML Schemas were
representing data and quality classes and alsotheir properties as XML elements
limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association
definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model
definition of a root element for each D2Q dataschema
definition of a root element for each D2Q
quality schema
An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types
The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion
352 From a D2Q schema instance to an XML file
The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one
ARTICLE IN PRESS
Fig 9 XML Schema of the D2Q data schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 563
is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules
a root element is defined to contain all dataobjects
a root element is defined to contain all qualityobjects
an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality
a property of a data class is mapped into anXML element
similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it
a property of a quality class is also mapped intoan XML element
We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9
4 The Data Quality Broker
The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation
41 Overview
The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q
global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints
ARTICLE IN PRESS
Fig 10 XML Schema of the D2Q quality schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582564
Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway
in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried
A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater
ARTICLE IN PRESS
Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 565
mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)
The Data Quality Broker performs two specifictasks
query processing and quality improvement
The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43
42 Query processing
The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q
model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved
Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582566
and we select the copy to return as a result on thebasis of such values
Specifically the detailed steps we follow toanswer a query with quality constraints are
(1)
Table
Map
Glob
Enter
Enter
Enter
Hold
Hold
y
y
Let Q be a query posed on the global schemaG
(2)
The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources
(3)
-Name-ID-LegalAddress
Enterprise
The execution of the queries Q1yQn
returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]
(4)
-FiscalCode-Address-Name
Holder
1
Fig 12 Graphical representation of the global schema
The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values
1
ping between global and local schemas
al schema INPS schema
prise Enterprise
priseID EnterpriseCode
priseLegalAddress X
er Owner
erName OwnerName
y
y
421 Example of query processing
In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1
Let us consider the following XQuery query tobe posed on the global schema
for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)
Enterprise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-
INAIL schema CoC schema
Business Enterprise
BusinessCode EnterpriseCode
X EnterpriseLegalAddress
Owner Holder
OwnerName HolderName
y y
y y
ARTICLE IN PRESS
Accuracy
item
2
Acc
Name=
medium
Acc
Address
frac14medium
Acc
Namefrac14
high
Italyrsquorsquo
Acc
Address
frac14high
Acc
Namefrac14
low
Acc
Address
frac14low
M Scannapieco et al Information Systems 29 (2004) 551ndash582 567
sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL
iecorsquorsquo
84016Paganirsquorsquo
piecorsquorsquo
84016Pagani(SA)
Italyrsquorsquo
for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-
prise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
Table
2
Queryresultsin
asample
scenario
Organization
Holderitem
1Accuracy
item
1Holderitem
2
INAIL
lsquolsquoMaim
oMecellarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanap
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
INPS
lsquolsquoMMecallarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanna
lsquolsquoVia
dei
Gracchi71RomaItalyrsquorsquo
Acc
Address
frac14medium
lsquolsquoVia
Mazzini27
CoC
lsquolsquoMassim
oMecellarsquorsquo
Acc
Namefrac14
high
lsquolsquoMScannarsquorsquo
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high
Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data
Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 553
on the analysis phase described above interpretingthe quality of exchanged information gives theopportunity of sending accurate feedbacks to datasource organizations which can then implementcorrection actions to improve their quality
The architecture presented in this paper consistsof different modules supporting each of thedescribed phases Nevertheless in this paper wefocus on two specific modules namely the DataQuality Broker and the Quality NotificationService which mainly support the Exchange andImprovement phases of the TDQM CIS cycle Amodel specifically addressing the Definition phaseis also discussed In Fig 1 the phases of theTDQM CIS cycle specifically addressed in thispaper are grey-colored
The structure of the paper is as follows InSection 2 a framework for Cooperative Informa-tion Systems is described and the proposedarchitecture is outlined furthermore a runningexample for exemplifying the contributions isintroduced the example stems from the Italiane-Government scenario In Section 3 the modelfor both the exchanging of data and their quality ispresented In Section 4 the Data Quality Broker isdescribed and its distributed design is discussed InSection 5 the Quality Notification Service isillustrated In Section 6 related research work ispresented Finally Section 7 concludes the paperand points out the possible future extensions andimprovement of the architecture
1 lsquolsquoDaQuinCISmdashMethodologies and Tools for Data Quality
inside Cooperative Information Systemsrsquorsquo (httpwwwdis
uniroma1it~dq) is a joint research project carried out by
DISmdashUniversita di Roma lsquolsquoLa Sapienzarsquorsquo DISComdashUniversita
di Milano lsquolsquoBicoccarsquorsquo and DEImdashPolitecnico di Milano
2 A cooperative framework for data quality
In current government and business scenariosorganizations start cooperating in order to offerservices to their customers and partners Organi-zations that cooperate have business links (ierelationships exchanged documents resourcesknowledge etc) connecting each other Specifi-cally organizations exploit business services (egthey exchange data or require services to be carriedout) on the basis of business links and thereforethe network of organizations and business linksconstitutes a cooperative business system
As an example a supply chain in which someenterprises offer basic products and some others
assemble them in order to deliver final products tocustomers is a cooperative business system Asanother example a set of public administrationswhich need to exchange information about citizensand their health state in order to provide socialaids is a cooperative business system derived fromthe Italian e-Government scenario [2]
A cooperative business system exists indepen-dently of the presence of a software infrastructuresupporting electronic data exchange and serviceprovisioning Indeed cooperative information sys-tems are software systems supporting cooperativebusiness systems in the remaining of this paperthe following definition of CIS is considered
A cooperative information system is formed by a
set of organizations fOrg1yOrgng that cooperate
through a communication infrastructure N which
provide software services to organizations as wells
as reliable connectivity Each organization Orgi is
connected to N through a gateway Gi on which
software services offered by Orgi to other organiza-
tions are deployed A user is a software or human
entity residing within an organization and using the
cooperative systemSeveral CISrsquos are characterized by a high degree
of data replicated in different organizations as anexample in an e-Government scenario the perso-nal data of a citizen are stored by almost alladministrations But in such scenarios the differ-ent organizations can provide the same data withdifferent quality levels thus any user of data mayappreciate to exploit the data with the highestquality level among the provided ones
Therefore only the highest quality data shouldbe returned to the user limiting the disseminationof low quality data Moreover the comparison ofthe gathered data values might be used to enforcea general improvement of data quality in allorganizations
In the context of the DaQuinCIS project1 weare proposing an architecture for the managementof data quality in CISrsquos this architecture allows
ARTICLE IN PRESS
CooperativeGateway
Quality NotificationService (QNS)
Data QualityBroker (DQB)
QualityFactory (QF)
back-end systems
internals of theorganization
Org1
CooperativeGateway
Org2
CooperativeGateway
Org2
CommunicationInfrastructure
CooperativeGateway
Orgn
CooperativeGateway
Orgn
RatingService
CooperativeGateway
Quality NotificationService (QNS)
Data QualityBroker (DQB)
QualityFactory (QF)
back-end systems
internals of theorganization
Org1
CooperativeGateway
Quality NotificationService (QNS)
Data QualityBroker (DQB)
QualityFactory (QF)
back-end systems
internals of theorganization
Org1
CooperativeGateway
Org2
CooperativeGateway
Org2
CommunicationInfrastructure
CooperativeGateway
OrgnOrgn
RatingService
CooperativeGateway
Fig 2 The DaQuinCIS architecture
M Scannapieco et al Information Systems 29 (2004) 551ndash582554
the diffusion of data and related quality andexploits data replication to improve the overallquality of cooperative data Each organizationoffers services to other organizations on its owncooperative gateway and also specific services toits internal back-end systems Therefore coopera-tive gateways interface both internally and ex-ternally through services Moreover thecommunication infrastructure itself offers somespecific services Services are all identical and peerie they are instances of the same softwareartifacts and act both as servers and clients ofthe other peers depending of the specific activitiesto be carried out The overall architecture isdepicted in Fig 2
Organizations export data and quality dataaccording to a common model referred to asData and Data Quality ethD2QTHORN model It includesthe definitions of (i) constructs to represent data(ii) a common set of data quality properties (iii)constructs to represent them and (iv) the associa-tion between data and quality data The D2Q
model is described in Section 3In order to produce data and quality data
according to the D2Q model each organizationdeploys on its cooperative gateway a Quality
Factory service that is responsible for evaluatingthe quality of its own data The design of the
Quality Factory is out of the scope of this paperand has been addressed in [7]
The Data Quality Broker poses on behalf of arequesting user a data request over other co-operating organizations also specifying a set ofquality requirements that the desired data have tosatisfy this is referred to as quality brokering
function Different copies of the same data receivedas responses to the request are reconciled and abest-quality value is selected and proposed toorganizations that can choose to discard their dataand to adopt higher quality ones this is referred toas quality improvement function
The Data Quality Broker is in essence a peer-to-peer data integration system [8] which allows topose quality-enhanced query over a global schemaand to select data satisfying such requirementsThe Data Quality Broker is described in Section 4
The Quality Notification Service is a publishsubscribe engine used as a quality message busbetween services andor organizations Morespecifically it allows quality-based subscriptionsfor users to be notified on changes of the quality ofdata For example an organization may want tobe notified if the quality of some data it usesdegrades below a certain threshold or when highquality data are available The Quality Notifica-tion Service is described in Section 5 Also the
ARTICLE IN PRESS
-BusinessName-Code
Business
-FiscalCode-Address-Name
Owner
1
(a)
-FiscalCode-Address-Name
Owner
-EnterpriseName-Code
Enterprise
1
(b)
-EntName-Code-LegalAddress
Enterprise
-FiscalCode-Address-Name
Holder
1
(c)
Fig 3 Local schemas used in the running example conceptual
representation (a) Local schema of INAIL (b) local schema of
INPS (c) Local schema of CoC
M Scannapieco et al Information Systems 29 (2004) 551ndash582 555
Quality Notification Service is deployed as a peer-to-peer system
The Rating Service associates trust values toeach data source in the CIS These values are usedto determine the reliability of the quality evalua-tion performed by organizations The RatingService is a centralized service to be provided bya third-party organization The design of theRating Service is out of the scope of this paperin this paper for sake of simplicity we assume thatall organizations are reliable ie provide trustablevalues with respect to quality of data Theinterested reader can find further details on theRating Service design and implementation in [9]
21 Running example
We use a running example in order to show theproposed ideas The example is a simplifiedscenario derived from the Italian Public Adminis-tration setting (interested reader can refer to [10]for a precise description of the real worldscenario) Specifically three agencies namely theSocial Security Agency referred to as IstitutoNazionale Previdenza Sociale (INPS) the Acci-dent Insurance Agency referred to as IstitutoNazionale Assistenza Infortuni sul Lavoro (IN-AIL) and the Chambers of Commerce referred toas CoC (Camere di Commercio) together offer asuite of services designed to help businesses andenterprises to fulfill their obligations towards thePublic Administration Typically businesses andenterprises must provide information to the threeagencies when well-defined business events2 occurduring their lifetime and they must be able tosubmit enquiries to the agencies regarding variousadministrative issues
In order to provide such services different dataare managed by the three agencies some data areagency-specific information about businesses (egemployees social insurance taxes tax reportsbalance sheets) whereas others are information
2Examples of business events are the starting of a new
businesscompany or the closing down variations in legal
status board composition and senior management variations
in the number of employees as well as opening a new location
filling for a patent etc
that is common to all of them Common itemsinclude (i) features characterizing the businessincluding one or more identifiers headquarter andbranches addresses legal form main economicactivity number of employees and contractorsinformation about the owners or partners and (ii)milestone dates including date of business start-upand (possible) date of cessation
Each agency makes a different use of differentpieces of the common information As a conse-quence each agency enforces different types ofquality control that are deemed adequate for thelocal use of the information Because each businessreports the common data independently to eachagency the copies maintained by the threeagencies may have different levels of quality
In order to improve the quality of the data andthus to improve the service level offered tobusinesses and enterprises the agencies can deploya cooperative information system according to theDaQuinCIS architecture therefore each agencyexports its data according to a local schema InFig 3 three simplified versions of the localschemas are shown
3 The D2Q model
According to the DaQuinCIS architecture allcooperating organizations export their application
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582556
data and quality data (ie data quality dimensionvalues evaluated for the application data) accord-ing to a specific data model The model forexporting data and quality data is referred to asData and Data Quality ethD2QTHORN model In thissection we first introduce the data quality dimen-sions used in the model (Section 31) then wedescribe the D2Q model with respect to the datafeatures (Section 32) and the quality features(Section 33) Finally we describe implementationissues of the D2Q model in Section 35
31 Data quality dimensions
Data quality dimensions are properties of datasuch as correctness or degree of updating Thedata quality dimensions used in this work concernonly data values instead they do not deal withaspects concerning quality of logical schema anddata format [6]
In this section we propose some data qualitydimensions to be used in CISrsquos The need forproviding such definitions stems from the lack of acommon reference set of dimensions in the dataquality literature [5] and from the peculiarities ofCISrsquos we propose four dimensions inspired by thealready proposed definitions and stemming fromreal requirements of CISrsquos scenarios that weexperienced [2]
In the following the general concept of schema
element is used corresponding for instance to anentity in an Entity-Relationship schema or to aclass in a Unified Modeling Language diagramWe define (i) accuracy (ii) completeness (iii)currency and (iv) consistency
311 Accuracy
In [6] accuracy refers to the proximity of a valuev to a value v0 considered as correct Morespecifically
Accuracy is the distance between v and v0 being v0
the value considered as correctLet us consider the following example Citizen
is a schema element with an attribute Name and p
is an instance of Citizen If pName has a valuev frac14 JHN while v0 frac14 JOHN this is a case of lowaccuracy as JHN is not an admissible valueaccording to a dictionary of English names
Accuracy can be checked by comparing datavalues with reference dictionaries (eg namedictionaries address lists domain related diction-aries such as product or commercial categorieslists) As an example in the case of values on stringdomain edit distance functions can be used tosupport such a task [11] such functions considerthe minimum number of operations on individualcharacters needed in order to transform a stringinto another
Accuracy is more difficult to measure fornumerical data values for which more complexnotions of edit distances need to be defined
312 Completeness
Completeness is the degree to which values of a
schema element are present in the schema element
instanceAccording to such a definition we can consider
(i) the completeness of an attribute value asdependent from the fact that the attribute ispresent or not (ii) the completeness of a schemaelement instance as dependent from the number ofthe attribute values that are present
A recent work [12] distinguishes other kinds ofcompleteness namely schema completeness col-umn completeness and population completenessThe measures of such completeness types can givevery useful information for a lsquolsquogeneralrsquorsquo assessmentof the data completeness of an organization butas it will be clarified in Section 33 our focus inthis paper is on associating a quality evaluation onlsquolsquoeachrsquorsquo data a cooperating organization makesavailable to the others
In evaluating completeness it is important toconsider the meaning of null values of an attributedepending on the attribute being mandatoryoptional or inapplicable a null value for amandatory attribute is associated with a lowercompleteness whereas completeness is not affectedby optional or inapplicable null values
As an example consider the attribute E-mail ofthe Citizen schema element A null value forE-mail may have different meanings First thespecific citizen may have no e-mail address andtherefore the attribute is inapplicable this case hasno impact on completeness Second the specific
ARTICLE IN PRESS
3Basic types are the ones provided by the most common
programming languages and SQL that is Integer Real
Boolean String Date Time Interval Currency Any
M Scannapieco et al Information Systems 29 (2004) 551ndash582 557
citizen has an e-mail address which has not beenstored this is a case of low completeness
313 Currency
The currency dimension refers only to datavalues that may vary in time as an example valuesof Address may vary in time whereas DateOf-
Birth can be considered invariant Currency canbe defined as
Currency is the distance between the instant when
a value changes in the real world and the instant
when the value itself is modified in the information
systemMore complex definitions of currency have been
proposed in the literature [13] they are especiallysuited for specific types of data ie they take intoaccount elements such as lsquolsquovolatilityrsquorsquo of data (egdata such as stock price quotes that change veryfrequently) However in this work we stay withthe provided general definition of currency leavingits extensions to future work
Currency can be measured either by associatingto each value an lsquolsquoupdating time-stamprsquorsquo [14] or alsquolsquotransaction timersquorsquo like in temporal databases [15]
314 Consistency
Consistency implies that two or more values donot conflict each other
A semantic rule is a constraint that must holdamong values of attributes of a schema elementdepending on the application domain modeled bythe schema element Consistency can be defined as
Consistency is the degree to which the values of
the attributes of an instance of a schema element
satisfy the specific set of semantic rules defined on
the schema elementAs an example if we consider Citizen with
attributes Name DateOfBirth Sex and DateOf-
Death some possible semantic rules to be checkedare
the values of Name and Sex for an instance p areconsistent If pName has a value v frac14 JOHN andthe value of pSex is FEMALE this is a case oflow internal consistency
the value of pDateOfBirth must precede thevalue of pDateOfDeath
Consistency has been widely investigated both inthe database area through integrity constraints(eg [16]) and in the statistics area through editschecking (eg [17])
32 The D2Q data model
The D2Q model is a semistructured model thatenhances the semantics of the XML data model[18] in order to represent quality data In thefollowing of this section we provide a formaldefinition for a data class and a D2Q data schemaD2Q data schemas define the data portion of theD2Q model instead in Section 33 we define D2Q
quality schemas corresponding to the qualityportion of the D2Q model
Definition 31 (Data class) A data class
d ethnamedp1y pnTHORN consists of
a name named a set of properties pi frac14 namei typeiS i frac14
1y n nX1 where namei is the name of theproperty pi and typei can be
3
either a basic type3
3
or a data class
3
or a type set-of XS where XS can beeither a basic type or a data class
Definition 32 (D2Q Data Schema) A D2Q dataschema SD is a direct node- and edge-labelledgraph with the following features
a set of nodes ND each of which is a data class a set of leaves T D that are all basic type
properties a set of edges EDDND ethNDT D) a set of nodes and leaves labels LD frac14
LNDLT D where LND is the set of node labelseach node label consisting of dataclass name data classS LT D is the set ofleaves labels each leaf label consisting ofproperty name basic typeS
ARTICLE IN PRESS
EnterpriseEnterpriseClass
OwnerOwnerClass
Namestring Codestring
1 1
M Scannapieco et al Information Systems 29 (2004) 551ndash582558
a set of edge labels LED defined as follows Letn1 n2SAED Let n1 be a data class and let l12be the label on the edge n1 n2S Such a label isl12 frac14number of occurrences of n2 ie thelabel specifies the cardinality of n2 in therelationship with n1
4
FiscalCodestring
Addressstring Namestring
11
1
Fig 4 A D2Q data schema of the running example
A D2Q data schema must have the twofollowing properties
Closure Property If SD contains a data class dhaving as a property another data class d0 thenalso d0 belongs to SD
Aciclicity No cycle exists in a D2Q schemaIn Fig 4 the D2Q data model of the INPS local
schema of the running example is shown Specifi-cally two data classes enterprise and owner arerepresented The edge connecting enterprise toowner is labelled with indicating the n-arycardinality of owner the edges connecting eachdata class with the related basic type properties arelabelled with 1 indicating the cardinality of suchproperties
33 The D2Q quality model
This section describes the model we use torepresent quality data ie the D2Q quality modelfor which we define the concepts of quality class
and quality schemaBesides basic data types we define a set of types
that are called quality types Quality types corre-spond to the set of values that quality dimensionscan have We indicate as
-
4W
for m
01 1x and
1minim
taccuracy tcompleteness tconsistency tcurrency
the quality types considered in our modelMore specifically each quality type is defined
over a domain of values referred to as DomaccuracyDomcompleteness Domconsistency and Domcurrency Suchdomains are defined according to the metrics usedto measure quality values moreover each domainis enhanced with the special symbol gt to indicate
e assume that it is possible to specify the following values
inimum and maximum cardinality of relationships 11 (where stands for n-ary relationship) 0 xy (where
y are any fixed integer different from 0 and 1) 0y 1yIf no minimum cardinality is specified it is assumed the
um cardinality is 1
that no quality value is defined for that dimensionFor instance if we adopt a 3-value scale tomeasure accuracy then Domaccuracy frac14 fHighMedium Lowgtg
A quality class is associated to each data classand to each property We use the followingnotation if d ethnamed p1ypiypnTHORN is a data classthen ld is the associated quality class and ldpi
isthe quality class associated to pi In the followingwe will first define ldpi
where pi is a basic typeproperty then we define the quality class ld
Definition 33 (ldpi-Quality class associated to a
(basic type) property of a data class) Letd ethnamedp1y piypnTHORN be a data class Let pi
be a basic type property of dThe quality class ldpi
is defined by
a name nameldpi
four properties kaccuracy kcompleteness kconsistencykcurrency such that ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN
Definition 34 (ld-Quality class associated to adata class) Let d ethnamed p1ypnTHORN be adata class A quality class ld ethnameld kaccuracykcompletenesskconsistency kcurrencypl1yplnTHORN consists of
a name nameld a set of properties ethkaccuracykcompleteness
kconsistencykcurrencypl1y plnTHORN such that
3
kaccuracy kcompleteness kconsistency kcurrency are suchthat ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 559
3
ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as
ndash
if pi is a basic type property then typeli is
ldpi where ldpi
is the quality class asso-ciated to d pi
ndash
if pi is a set of XS type where X is abasic type then typeli is set of ldpi
Swhere ldpi
is the quality class associated tod pi
ndash
if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0
ndash
if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0
Definition 35 (D2Q Quality Schema) A D2Q
quality schema SQ is a direct node- and edge-
labelled graph with the following features
a set of nodes NQ each of which is a qualityclass
a set of leaves T Q that are quality typeproperties
a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q
where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels
Enterprise_Quality
Owner_QualityOwnerQualityClass
FiscalCode_QualityFiscalCodeQualityClass
Address_QualityAddressQualityClass
QualityN
Owner_Accuracyτ
hellip
hellip hellip
FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy
EnterprAccur
1 1 1
accuracy
Fig 5 A D2Q quality schema
consisting of quality property namequality typeS
a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows
3
Ente
Namame
a
hellipise_acy
of
if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality
3
if n2 is a quality class then l12 frac14number
of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1
Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality
are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the
rpriseQualityClass
e_QualityClass
Name_QualityName_QualityClass
Name_ccuracy
Code_QualityCodeQualityClass
hellip
hellip
hellip
Code_Accuracyτ
Name_Accuracyτ
1
1
accuracy
accuracy
the running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582560
domains of data quality dimension types to such ascope
34 The D2Q model data + quality
In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q
data schema and the D2Q quality schemaSpecifically we define the notion of quality
association between nodes of the two graphs
Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q
data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ
A quality association is a functionqualityAss A-B such that
(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function
It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions
Enterprise
Owner
NameCode
Owner_Quality
quality
qualityhelliphellip
quality
quality
1 1
Fig 6 A D2Q schema of t
Definition 37 (D2Q Schema) Given the D2Q
data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures
a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ
a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where
EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ
a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ
where LEDQ frac14 fqualityg is a set of quality
labels
The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)
341 D2Q schema instances
The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q
model
Enterprise_Quality
Name_Quality
Enterprise_accuracy
Name_accuracy
Code_Quality
Code_accuracy
1
1
1
hellip
hellip
hellip
he running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 561
More specifically
data classes instances are data objects similarly quality classes instances are quality
objects quality association values corresponds to qual-
ity links
A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type
property name basic type property valueSEdges are not labelled at all
Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type
property name quality type property valueSEdges are not labelled at all
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
Fig 7 D2Q schema instance
Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label
Fig 7 shows the graph representation of a D2Q
schema instance of the running example TheEnterprise1 object instance of the Enterprise
class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)
35 Implementation of the D2Q model
In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
of the running example
ARTICLE IN PRESS
Fig 8 An example of definition of a quality type as an XML
Schema type
M Scannapieco et al Information Systems 29 (2004) 551ndash582562
The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q
schema instances as XML documents is alsodescribed in the following of this section
351 From a D2Q schema to an XML schema
D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]
Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them
Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate
The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be
identified by explicitly introducing Object Identi-fiers (OIDs)
Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects
The principal choices for the representation ofD2Q schemas as XML Schemas were
representing data and quality classes and alsotheir properties as XML elements
limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association
definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model
definition of a root element for each D2Q dataschema
definition of a root element for each D2Q
quality schema
An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types
The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion
352 From a D2Q schema instance to an XML file
The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one
ARTICLE IN PRESS
Fig 9 XML Schema of the D2Q data schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 563
is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules
a root element is defined to contain all dataobjects
a root element is defined to contain all qualityobjects
an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality
a property of a data class is mapped into anXML element
similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it
a property of a quality class is also mapped intoan XML element
We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9
4 The Data Quality Broker
The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation
41 Overview
The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q
global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints
ARTICLE IN PRESS
Fig 10 XML Schema of the D2Q quality schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582564
Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway
in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried
A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater
ARTICLE IN PRESS
Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 565
mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)
The Data Quality Broker performs two specifictasks
query processing and quality improvement
The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43
42 Query processing
The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q
model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved
Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582566
and we select the copy to return as a result on thebasis of such values
Specifically the detailed steps we follow toanswer a query with quality constraints are
(1)
Table
Map
Glob
Enter
Enter
Enter
Hold
Hold
y
y
Let Q be a query posed on the global schemaG
(2)
The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources
(3)
-Name-ID-LegalAddress
Enterprise
The execution of the queries Q1yQn
returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]
(4)
-FiscalCode-Address-Name
Holder
1
Fig 12 Graphical representation of the global schema
The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values
1
ping between global and local schemas
al schema INPS schema
prise Enterprise
priseID EnterpriseCode
priseLegalAddress X
er Owner
erName OwnerName
y
y
421 Example of query processing
In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1
Let us consider the following XQuery query tobe posed on the global schema
for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)
Enterprise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-
INAIL schema CoC schema
Business Enterprise
BusinessCode EnterpriseCode
X EnterpriseLegalAddress
Owner Holder
OwnerName HolderName
y y
y y
ARTICLE IN PRESS
Accuracy
item
2
Acc
Name=
medium
Acc
Address
frac14medium
Acc
Namefrac14
high
Italyrsquorsquo
Acc
Address
frac14high
Acc
Namefrac14
low
Acc
Address
frac14low
M Scannapieco et al Information Systems 29 (2004) 551ndash582 567
sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL
iecorsquorsquo
84016Paganirsquorsquo
piecorsquorsquo
84016Pagani(SA)
Italyrsquorsquo
for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-
prise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
Table
2
Queryresultsin
asample
scenario
Organization
Holderitem
1Accuracy
item
1Holderitem
2
INAIL
lsquolsquoMaim
oMecellarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanap
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
INPS
lsquolsquoMMecallarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanna
lsquolsquoVia
dei
Gracchi71RomaItalyrsquorsquo
Acc
Address
frac14medium
lsquolsquoVia
Mazzini27
CoC
lsquolsquoMassim
oMecellarsquorsquo
Acc
Namefrac14
high
lsquolsquoMScannarsquorsquo
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high
Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data
Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
CooperativeGateway
Quality NotificationService (QNS)
Data QualityBroker (DQB)
QualityFactory (QF)
back-end systems
internals of theorganization
Org1
CooperativeGateway
Org2
CooperativeGateway
Org2
CommunicationInfrastructure
CooperativeGateway
Orgn
CooperativeGateway
Orgn
RatingService
CooperativeGateway
Quality NotificationService (QNS)
Data QualityBroker (DQB)
QualityFactory (QF)
back-end systems
internals of theorganization
Org1
CooperativeGateway
Quality NotificationService (QNS)
Data QualityBroker (DQB)
QualityFactory (QF)
back-end systems
internals of theorganization
Org1
CooperativeGateway
Org2
CooperativeGateway
Org2
CommunicationInfrastructure
CooperativeGateway
OrgnOrgn
RatingService
CooperativeGateway
Fig 2 The DaQuinCIS architecture
M Scannapieco et al Information Systems 29 (2004) 551ndash582554
the diffusion of data and related quality andexploits data replication to improve the overallquality of cooperative data Each organizationoffers services to other organizations on its owncooperative gateway and also specific services toits internal back-end systems Therefore coopera-tive gateways interface both internally and ex-ternally through services Moreover thecommunication infrastructure itself offers somespecific services Services are all identical and peerie they are instances of the same softwareartifacts and act both as servers and clients ofthe other peers depending of the specific activitiesto be carried out The overall architecture isdepicted in Fig 2
Organizations export data and quality dataaccording to a common model referred to asData and Data Quality ethD2QTHORN model It includesthe definitions of (i) constructs to represent data(ii) a common set of data quality properties (iii)constructs to represent them and (iv) the associa-tion between data and quality data The D2Q
model is described in Section 3In order to produce data and quality data
according to the D2Q model each organizationdeploys on its cooperative gateway a Quality
Factory service that is responsible for evaluatingthe quality of its own data The design of the
Quality Factory is out of the scope of this paperand has been addressed in [7]
The Data Quality Broker poses on behalf of arequesting user a data request over other co-operating organizations also specifying a set ofquality requirements that the desired data have tosatisfy this is referred to as quality brokering
function Different copies of the same data receivedas responses to the request are reconciled and abest-quality value is selected and proposed toorganizations that can choose to discard their dataand to adopt higher quality ones this is referred toas quality improvement function
The Data Quality Broker is in essence a peer-to-peer data integration system [8] which allows topose quality-enhanced query over a global schemaand to select data satisfying such requirementsThe Data Quality Broker is described in Section 4
The Quality Notification Service is a publishsubscribe engine used as a quality message busbetween services andor organizations Morespecifically it allows quality-based subscriptionsfor users to be notified on changes of the quality ofdata For example an organization may want tobe notified if the quality of some data it usesdegrades below a certain threshold or when highquality data are available The Quality Notifica-tion Service is described in Section 5 Also the
ARTICLE IN PRESS
-BusinessName-Code
Business
-FiscalCode-Address-Name
Owner
1
(a)
-FiscalCode-Address-Name
Owner
-EnterpriseName-Code
Enterprise
1
(b)
-EntName-Code-LegalAddress
Enterprise
-FiscalCode-Address-Name
Holder
1
(c)
Fig 3 Local schemas used in the running example conceptual
representation (a) Local schema of INAIL (b) local schema of
INPS (c) Local schema of CoC
M Scannapieco et al Information Systems 29 (2004) 551ndash582 555
Quality Notification Service is deployed as a peer-to-peer system
The Rating Service associates trust values toeach data source in the CIS These values are usedto determine the reliability of the quality evalua-tion performed by organizations The RatingService is a centralized service to be provided bya third-party organization The design of theRating Service is out of the scope of this paperin this paper for sake of simplicity we assume thatall organizations are reliable ie provide trustablevalues with respect to quality of data Theinterested reader can find further details on theRating Service design and implementation in [9]
21 Running example
We use a running example in order to show theproposed ideas The example is a simplifiedscenario derived from the Italian Public Adminis-tration setting (interested reader can refer to [10]for a precise description of the real worldscenario) Specifically three agencies namely theSocial Security Agency referred to as IstitutoNazionale Previdenza Sociale (INPS) the Acci-dent Insurance Agency referred to as IstitutoNazionale Assistenza Infortuni sul Lavoro (IN-AIL) and the Chambers of Commerce referred toas CoC (Camere di Commercio) together offer asuite of services designed to help businesses andenterprises to fulfill their obligations towards thePublic Administration Typically businesses andenterprises must provide information to the threeagencies when well-defined business events2 occurduring their lifetime and they must be able tosubmit enquiries to the agencies regarding variousadministrative issues
In order to provide such services different dataare managed by the three agencies some data areagency-specific information about businesses (egemployees social insurance taxes tax reportsbalance sheets) whereas others are information
2Examples of business events are the starting of a new
businesscompany or the closing down variations in legal
status board composition and senior management variations
in the number of employees as well as opening a new location
filling for a patent etc
that is common to all of them Common itemsinclude (i) features characterizing the businessincluding one or more identifiers headquarter andbranches addresses legal form main economicactivity number of employees and contractorsinformation about the owners or partners and (ii)milestone dates including date of business start-upand (possible) date of cessation
Each agency makes a different use of differentpieces of the common information As a conse-quence each agency enforces different types ofquality control that are deemed adequate for thelocal use of the information Because each businessreports the common data independently to eachagency the copies maintained by the threeagencies may have different levels of quality
In order to improve the quality of the data andthus to improve the service level offered tobusinesses and enterprises the agencies can deploya cooperative information system according to theDaQuinCIS architecture therefore each agencyexports its data according to a local schema InFig 3 three simplified versions of the localschemas are shown
3 The D2Q model
According to the DaQuinCIS architecture allcooperating organizations export their application
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582556
data and quality data (ie data quality dimensionvalues evaluated for the application data) accord-ing to a specific data model The model forexporting data and quality data is referred to asData and Data Quality ethD2QTHORN model In thissection we first introduce the data quality dimen-sions used in the model (Section 31) then wedescribe the D2Q model with respect to the datafeatures (Section 32) and the quality features(Section 33) Finally we describe implementationissues of the D2Q model in Section 35
31 Data quality dimensions
Data quality dimensions are properties of datasuch as correctness or degree of updating Thedata quality dimensions used in this work concernonly data values instead they do not deal withaspects concerning quality of logical schema anddata format [6]
In this section we propose some data qualitydimensions to be used in CISrsquos The need forproviding such definitions stems from the lack of acommon reference set of dimensions in the dataquality literature [5] and from the peculiarities ofCISrsquos we propose four dimensions inspired by thealready proposed definitions and stemming fromreal requirements of CISrsquos scenarios that weexperienced [2]
In the following the general concept of schema
element is used corresponding for instance to anentity in an Entity-Relationship schema or to aclass in a Unified Modeling Language diagramWe define (i) accuracy (ii) completeness (iii)currency and (iv) consistency
311 Accuracy
In [6] accuracy refers to the proximity of a valuev to a value v0 considered as correct Morespecifically
Accuracy is the distance between v and v0 being v0
the value considered as correctLet us consider the following example Citizen
is a schema element with an attribute Name and p
is an instance of Citizen If pName has a valuev frac14 JHN while v0 frac14 JOHN this is a case of lowaccuracy as JHN is not an admissible valueaccording to a dictionary of English names
Accuracy can be checked by comparing datavalues with reference dictionaries (eg namedictionaries address lists domain related diction-aries such as product or commercial categorieslists) As an example in the case of values on stringdomain edit distance functions can be used tosupport such a task [11] such functions considerthe minimum number of operations on individualcharacters needed in order to transform a stringinto another
Accuracy is more difficult to measure fornumerical data values for which more complexnotions of edit distances need to be defined
312 Completeness
Completeness is the degree to which values of a
schema element are present in the schema element
instanceAccording to such a definition we can consider
(i) the completeness of an attribute value asdependent from the fact that the attribute ispresent or not (ii) the completeness of a schemaelement instance as dependent from the number ofthe attribute values that are present
A recent work [12] distinguishes other kinds ofcompleteness namely schema completeness col-umn completeness and population completenessThe measures of such completeness types can givevery useful information for a lsquolsquogeneralrsquorsquo assessmentof the data completeness of an organization butas it will be clarified in Section 33 our focus inthis paper is on associating a quality evaluation onlsquolsquoeachrsquorsquo data a cooperating organization makesavailable to the others
In evaluating completeness it is important toconsider the meaning of null values of an attributedepending on the attribute being mandatoryoptional or inapplicable a null value for amandatory attribute is associated with a lowercompleteness whereas completeness is not affectedby optional or inapplicable null values
As an example consider the attribute E-mail ofthe Citizen schema element A null value forE-mail may have different meanings First thespecific citizen may have no e-mail address andtherefore the attribute is inapplicable this case hasno impact on completeness Second the specific
ARTICLE IN PRESS
3Basic types are the ones provided by the most common
programming languages and SQL that is Integer Real
Boolean String Date Time Interval Currency Any
M Scannapieco et al Information Systems 29 (2004) 551ndash582 557
citizen has an e-mail address which has not beenstored this is a case of low completeness
313 Currency
The currency dimension refers only to datavalues that may vary in time as an example valuesof Address may vary in time whereas DateOf-
Birth can be considered invariant Currency canbe defined as
Currency is the distance between the instant when
a value changes in the real world and the instant
when the value itself is modified in the information
systemMore complex definitions of currency have been
proposed in the literature [13] they are especiallysuited for specific types of data ie they take intoaccount elements such as lsquolsquovolatilityrsquorsquo of data (egdata such as stock price quotes that change veryfrequently) However in this work we stay withthe provided general definition of currency leavingits extensions to future work
Currency can be measured either by associatingto each value an lsquolsquoupdating time-stamprsquorsquo [14] or alsquolsquotransaction timersquorsquo like in temporal databases [15]
314 Consistency
Consistency implies that two or more values donot conflict each other
A semantic rule is a constraint that must holdamong values of attributes of a schema elementdepending on the application domain modeled bythe schema element Consistency can be defined as
Consistency is the degree to which the values of
the attributes of an instance of a schema element
satisfy the specific set of semantic rules defined on
the schema elementAs an example if we consider Citizen with
attributes Name DateOfBirth Sex and DateOf-
Death some possible semantic rules to be checkedare
the values of Name and Sex for an instance p areconsistent If pName has a value v frac14 JOHN andthe value of pSex is FEMALE this is a case oflow internal consistency
the value of pDateOfBirth must precede thevalue of pDateOfDeath
Consistency has been widely investigated both inthe database area through integrity constraints(eg [16]) and in the statistics area through editschecking (eg [17])
32 The D2Q data model
The D2Q model is a semistructured model thatenhances the semantics of the XML data model[18] in order to represent quality data In thefollowing of this section we provide a formaldefinition for a data class and a D2Q data schemaD2Q data schemas define the data portion of theD2Q model instead in Section 33 we define D2Q
quality schemas corresponding to the qualityportion of the D2Q model
Definition 31 (Data class) A data class
d ethnamedp1y pnTHORN consists of
a name named a set of properties pi frac14 namei typeiS i frac14
1y n nX1 where namei is the name of theproperty pi and typei can be
3
either a basic type3
3
or a data class
3
or a type set-of XS where XS can beeither a basic type or a data class
Definition 32 (D2Q Data Schema) A D2Q dataschema SD is a direct node- and edge-labelledgraph with the following features
a set of nodes ND each of which is a data class a set of leaves T D that are all basic type
properties a set of edges EDDND ethNDT D) a set of nodes and leaves labels LD frac14
LNDLT D where LND is the set of node labelseach node label consisting of dataclass name data classS LT D is the set ofleaves labels each leaf label consisting ofproperty name basic typeS
ARTICLE IN PRESS
EnterpriseEnterpriseClass
OwnerOwnerClass
Namestring Codestring
1 1
M Scannapieco et al Information Systems 29 (2004) 551ndash582558
a set of edge labels LED defined as follows Letn1 n2SAED Let n1 be a data class and let l12be the label on the edge n1 n2S Such a label isl12 frac14number of occurrences of n2 ie thelabel specifies the cardinality of n2 in therelationship with n1
4
FiscalCodestring
Addressstring Namestring
11
1
Fig 4 A D2Q data schema of the running example
A D2Q data schema must have the twofollowing properties
Closure Property If SD contains a data class dhaving as a property another data class d0 thenalso d0 belongs to SD
Aciclicity No cycle exists in a D2Q schemaIn Fig 4 the D2Q data model of the INPS local
schema of the running example is shown Specifi-cally two data classes enterprise and owner arerepresented The edge connecting enterprise toowner is labelled with indicating the n-arycardinality of owner the edges connecting eachdata class with the related basic type properties arelabelled with 1 indicating the cardinality of suchproperties
33 The D2Q quality model
This section describes the model we use torepresent quality data ie the D2Q quality modelfor which we define the concepts of quality class
and quality schemaBesides basic data types we define a set of types
that are called quality types Quality types corre-spond to the set of values that quality dimensionscan have We indicate as
-
4W
for m
01 1x and
1minim
taccuracy tcompleteness tconsistency tcurrency
the quality types considered in our modelMore specifically each quality type is defined
over a domain of values referred to as DomaccuracyDomcompleteness Domconsistency and Domcurrency Suchdomains are defined according to the metrics usedto measure quality values moreover each domainis enhanced with the special symbol gt to indicate
e assume that it is possible to specify the following values
inimum and maximum cardinality of relationships 11 (where stands for n-ary relationship) 0 xy (where
y are any fixed integer different from 0 and 1) 0y 1yIf no minimum cardinality is specified it is assumed the
um cardinality is 1
that no quality value is defined for that dimensionFor instance if we adopt a 3-value scale tomeasure accuracy then Domaccuracy frac14 fHighMedium Lowgtg
A quality class is associated to each data classand to each property We use the followingnotation if d ethnamed p1ypiypnTHORN is a data classthen ld is the associated quality class and ldpi
isthe quality class associated to pi In the followingwe will first define ldpi
where pi is a basic typeproperty then we define the quality class ld
Definition 33 (ldpi-Quality class associated to a
(basic type) property of a data class) Letd ethnamedp1y piypnTHORN be a data class Let pi
be a basic type property of dThe quality class ldpi
is defined by
a name nameldpi
four properties kaccuracy kcompleteness kconsistencykcurrency such that ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN
Definition 34 (ld-Quality class associated to adata class) Let d ethnamed p1ypnTHORN be adata class A quality class ld ethnameld kaccuracykcompletenesskconsistency kcurrencypl1yplnTHORN consists of
a name nameld a set of properties ethkaccuracykcompleteness
kconsistencykcurrencypl1y plnTHORN such that
3
kaccuracy kcompleteness kconsistency kcurrency are suchthat ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 559
3
ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as
ndash
if pi is a basic type property then typeli is
ldpi where ldpi
is the quality class asso-ciated to d pi
ndash
if pi is a set of XS type where X is abasic type then typeli is set of ldpi
Swhere ldpi
is the quality class associated tod pi
ndash
if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0
ndash
if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0
Definition 35 (D2Q Quality Schema) A D2Q
quality schema SQ is a direct node- and edge-
labelled graph with the following features
a set of nodes NQ each of which is a qualityclass
a set of leaves T Q that are quality typeproperties
a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q
where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels
Enterprise_Quality
Owner_QualityOwnerQualityClass
FiscalCode_QualityFiscalCodeQualityClass
Address_QualityAddressQualityClass
QualityN
Owner_Accuracyτ
hellip
hellip hellip
FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy
EnterprAccur
1 1 1
accuracy
Fig 5 A D2Q quality schema
consisting of quality property namequality typeS
a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows
3
Ente
Namame
a
hellipise_acy
of
if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality
3
if n2 is a quality class then l12 frac14number
of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1
Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality
are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the
rpriseQualityClass
e_QualityClass
Name_QualityName_QualityClass
Name_ccuracy
Code_QualityCodeQualityClass
hellip
hellip
hellip
Code_Accuracyτ
Name_Accuracyτ
1
1
accuracy
accuracy
the running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582560
domains of data quality dimension types to such ascope
34 The D2Q model data + quality
In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q
data schema and the D2Q quality schemaSpecifically we define the notion of quality
association between nodes of the two graphs
Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q
data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ
A quality association is a functionqualityAss A-B such that
(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function
It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions
Enterprise
Owner
NameCode
Owner_Quality
quality
qualityhelliphellip
quality
quality
1 1
Fig 6 A D2Q schema of t
Definition 37 (D2Q Schema) Given the D2Q
data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures
a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ
a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where
EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ
a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ
where LEDQ frac14 fqualityg is a set of quality
labels
The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)
341 D2Q schema instances
The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q
model
Enterprise_Quality
Name_Quality
Enterprise_accuracy
Name_accuracy
Code_Quality
Code_accuracy
1
1
1
hellip
hellip
hellip
he running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 561
More specifically
data classes instances are data objects similarly quality classes instances are quality
objects quality association values corresponds to qual-
ity links
A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type
property name basic type property valueSEdges are not labelled at all
Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type
property name quality type property valueSEdges are not labelled at all
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
Fig 7 D2Q schema instance
Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label
Fig 7 shows the graph representation of a D2Q
schema instance of the running example TheEnterprise1 object instance of the Enterprise
class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)
35 Implementation of the D2Q model
In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
of the running example
ARTICLE IN PRESS
Fig 8 An example of definition of a quality type as an XML
Schema type
M Scannapieco et al Information Systems 29 (2004) 551ndash582562
The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q
schema instances as XML documents is alsodescribed in the following of this section
351 From a D2Q schema to an XML schema
D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]
Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them
Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate
The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be
identified by explicitly introducing Object Identi-fiers (OIDs)
Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects
The principal choices for the representation ofD2Q schemas as XML Schemas were
representing data and quality classes and alsotheir properties as XML elements
limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association
definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model
definition of a root element for each D2Q dataschema
definition of a root element for each D2Q
quality schema
An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types
The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion
352 From a D2Q schema instance to an XML file
The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one
ARTICLE IN PRESS
Fig 9 XML Schema of the D2Q data schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 563
is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules
a root element is defined to contain all dataobjects
a root element is defined to contain all qualityobjects
an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality
a property of a data class is mapped into anXML element
similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it
a property of a quality class is also mapped intoan XML element
We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9
4 The Data Quality Broker
The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation
41 Overview
The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q
global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints
ARTICLE IN PRESS
Fig 10 XML Schema of the D2Q quality schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582564
Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway
in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried
A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater
ARTICLE IN PRESS
Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 565
mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)
The Data Quality Broker performs two specifictasks
query processing and quality improvement
The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43
42 Query processing
The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q
model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved
Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582566
and we select the copy to return as a result on thebasis of such values
Specifically the detailed steps we follow toanswer a query with quality constraints are
(1)
Table
Map
Glob
Enter
Enter
Enter
Hold
Hold
y
y
Let Q be a query posed on the global schemaG
(2)
The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources
(3)
-Name-ID-LegalAddress
Enterprise
The execution of the queries Q1yQn
returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]
(4)
-FiscalCode-Address-Name
Holder
1
Fig 12 Graphical representation of the global schema
The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values
1
ping between global and local schemas
al schema INPS schema
prise Enterprise
priseID EnterpriseCode
priseLegalAddress X
er Owner
erName OwnerName
y
y
421 Example of query processing
In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1
Let us consider the following XQuery query tobe posed on the global schema
for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)
Enterprise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-
INAIL schema CoC schema
Business Enterprise
BusinessCode EnterpriseCode
X EnterpriseLegalAddress
Owner Holder
OwnerName HolderName
y y
y y
ARTICLE IN PRESS
Accuracy
item
2
Acc
Name=
medium
Acc
Address
frac14medium
Acc
Namefrac14
high
Italyrsquorsquo
Acc
Address
frac14high
Acc
Namefrac14
low
Acc
Address
frac14low
M Scannapieco et al Information Systems 29 (2004) 551ndash582 567
sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL
iecorsquorsquo
84016Paganirsquorsquo
piecorsquorsquo
84016Pagani(SA)
Italyrsquorsquo
for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-
prise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
Table
2
Queryresultsin
asample
scenario
Organization
Holderitem
1Accuracy
item
1Holderitem
2
INAIL
lsquolsquoMaim
oMecellarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanap
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
INPS
lsquolsquoMMecallarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanna
lsquolsquoVia
dei
Gracchi71RomaItalyrsquorsquo
Acc
Address
frac14medium
lsquolsquoVia
Mazzini27
CoC
lsquolsquoMassim
oMecellarsquorsquo
Acc
Namefrac14
high
lsquolsquoMScannarsquorsquo
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high
Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data
Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
-BusinessName-Code
Business
-FiscalCode-Address-Name
Owner
1
(a)
-FiscalCode-Address-Name
Owner
-EnterpriseName-Code
Enterprise
1
(b)
-EntName-Code-LegalAddress
Enterprise
-FiscalCode-Address-Name
Holder
1
(c)
Fig 3 Local schemas used in the running example conceptual
representation (a) Local schema of INAIL (b) local schema of
INPS (c) Local schema of CoC
M Scannapieco et al Information Systems 29 (2004) 551ndash582 555
Quality Notification Service is deployed as a peer-to-peer system
The Rating Service associates trust values toeach data source in the CIS These values are usedto determine the reliability of the quality evalua-tion performed by organizations The RatingService is a centralized service to be provided bya third-party organization The design of theRating Service is out of the scope of this paperin this paper for sake of simplicity we assume thatall organizations are reliable ie provide trustablevalues with respect to quality of data Theinterested reader can find further details on theRating Service design and implementation in [9]
21 Running example
We use a running example in order to show theproposed ideas The example is a simplifiedscenario derived from the Italian Public Adminis-tration setting (interested reader can refer to [10]for a precise description of the real worldscenario) Specifically three agencies namely theSocial Security Agency referred to as IstitutoNazionale Previdenza Sociale (INPS) the Acci-dent Insurance Agency referred to as IstitutoNazionale Assistenza Infortuni sul Lavoro (IN-AIL) and the Chambers of Commerce referred toas CoC (Camere di Commercio) together offer asuite of services designed to help businesses andenterprises to fulfill their obligations towards thePublic Administration Typically businesses andenterprises must provide information to the threeagencies when well-defined business events2 occurduring their lifetime and they must be able tosubmit enquiries to the agencies regarding variousadministrative issues
In order to provide such services different dataare managed by the three agencies some data areagency-specific information about businesses (egemployees social insurance taxes tax reportsbalance sheets) whereas others are information
2Examples of business events are the starting of a new
businesscompany or the closing down variations in legal
status board composition and senior management variations
in the number of employees as well as opening a new location
filling for a patent etc
that is common to all of them Common itemsinclude (i) features characterizing the businessincluding one or more identifiers headquarter andbranches addresses legal form main economicactivity number of employees and contractorsinformation about the owners or partners and (ii)milestone dates including date of business start-upand (possible) date of cessation
Each agency makes a different use of differentpieces of the common information As a conse-quence each agency enforces different types ofquality control that are deemed adequate for thelocal use of the information Because each businessreports the common data independently to eachagency the copies maintained by the threeagencies may have different levels of quality
In order to improve the quality of the data andthus to improve the service level offered tobusinesses and enterprises the agencies can deploya cooperative information system according to theDaQuinCIS architecture therefore each agencyexports its data according to a local schema InFig 3 three simplified versions of the localschemas are shown
3 The D2Q model
According to the DaQuinCIS architecture allcooperating organizations export their application
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582556
data and quality data (ie data quality dimensionvalues evaluated for the application data) accord-ing to a specific data model The model forexporting data and quality data is referred to asData and Data Quality ethD2QTHORN model In thissection we first introduce the data quality dimen-sions used in the model (Section 31) then wedescribe the D2Q model with respect to the datafeatures (Section 32) and the quality features(Section 33) Finally we describe implementationissues of the D2Q model in Section 35
31 Data quality dimensions
Data quality dimensions are properties of datasuch as correctness or degree of updating Thedata quality dimensions used in this work concernonly data values instead they do not deal withaspects concerning quality of logical schema anddata format [6]
In this section we propose some data qualitydimensions to be used in CISrsquos The need forproviding such definitions stems from the lack of acommon reference set of dimensions in the dataquality literature [5] and from the peculiarities ofCISrsquos we propose four dimensions inspired by thealready proposed definitions and stemming fromreal requirements of CISrsquos scenarios that weexperienced [2]
In the following the general concept of schema
element is used corresponding for instance to anentity in an Entity-Relationship schema or to aclass in a Unified Modeling Language diagramWe define (i) accuracy (ii) completeness (iii)currency and (iv) consistency
311 Accuracy
In [6] accuracy refers to the proximity of a valuev to a value v0 considered as correct Morespecifically
Accuracy is the distance between v and v0 being v0
the value considered as correctLet us consider the following example Citizen
is a schema element with an attribute Name and p
is an instance of Citizen If pName has a valuev frac14 JHN while v0 frac14 JOHN this is a case of lowaccuracy as JHN is not an admissible valueaccording to a dictionary of English names
Accuracy can be checked by comparing datavalues with reference dictionaries (eg namedictionaries address lists domain related diction-aries such as product or commercial categorieslists) As an example in the case of values on stringdomain edit distance functions can be used tosupport such a task [11] such functions considerthe minimum number of operations on individualcharacters needed in order to transform a stringinto another
Accuracy is more difficult to measure fornumerical data values for which more complexnotions of edit distances need to be defined
312 Completeness
Completeness is the degree to which values of a
schema element are present in the schema element
instanceAccording to such a definition we can consider
(i) the completeness of an attribute value asdependent from the fact that the attribute ispresent or not (ii) the completeness of a schemaelement instance as dependent from the number ofthe attribute values that are present
A recent work [12] distinguishes other kinds ofcompleteness namely schema completeness col-umn completeness and population completenessThe measures of such completeness types can givevery useful information for a lsquolsquogeneralrsquorsquo assessmentof the data completeness of an organization butas it will be clarified in Section 33 our focus inthis paper is on associating a quality evaluation onlsquolsquoeachrsquorsquo data a cooperating organization makesavailable to the others
In evaluating completeness it is important toconsider the meaning of null values of an attributedepending on the attribute being mandatoryoptional or inapplicable a null value for amandatory attribute is associated with a lowercompleteness whereas completeness is not affectedby optional or inapplicable null values
As an example consider the attribute E-mail ofthe Citizen schema element A null value forE-mail may have different meanings First thespecific citizen may have no e-mail address andtherefore the attribute is inapplicable this case hasno impact on completeness Second the specific
ARTICLE IN PRESS
3Basic types are the ones provided by the most common
programming languages and SQL that is Integer Real
Boolean String Date Time Interval Currency Any
M Scannapieco et al Information Systems 29 (2004) 551ndash582 557
citizen has an e-mail address which has not beenstored this is a case of low completeness
313 Currency
The currency dimension refers only to datavalues that may vary in time as an example valuesof Address may vary in time whereas DateOf-
Birth can be considered invariant Currency canbe defined as
Currency is the distance between the instant when
a value changes in the real world and the instant
when the value itself is modified in the information
systemMore complex definitions of currency have been
proposed in the literature [13] they are especiallysuited for specific types of data ie they take intoaccount elements such as lsquolsquovolatilityrsquorsquo of data (egdata such as stock price quotes that change veryfrequently) However in this work we stay withthe provided general definition of currency leavingits extensions to future work
Currency can be measured either by associatingto each value an lsquolsquoupdating time-stamprsquorsquo [14] or alsquolsquotransaction timersquorsquo like in temporal databases [15]
314 Consistency
Consistency implies that two or more values donot conflict each other
A semantic rule is a constraint that must holdamong values of attributes of a schema elementdepending on the application domain modeled bythe schema element Consistency can be defined as
Consistency is the degree to which the values of
the attributes of an instance of a schema element
satisfy the specific set of semantic rules defined on
the schema elementAs an example if we consider Citizen with
attributes Name DateOfBirth Sex and DateOf-
Death some possible semantic rules to be checkedare
the values of Name and Sex for an instance p areconsistent If pName has a value v frac14 JOHN andthe value of pSex is FEMALE this is a case oflow internal consistency
the value of pDateOfBirth must precede thevalue of pDateOfDeath
Consistency has been widely investigated both inthe database area through integrity constraints(eg [16]) and in the statistics area through editschecking (eg [17])
32 The D2Q data model
The D2Q model is a semistructured model thatenhances the semantics of the XML data model[18] in order to represent quality data In thefollowing of this section we provide a formaldefinition for a data class and a D2Q data schemaD2Q data schemas define the data portion of theD2Q model instead in Section 33 we define D2Q
quality schemas corresponding to the qualityportion of the D2Q model
Definition 31 (Data class) A data class
d ethnamedp1y pnTHORN consists of
a name named a set of properties pi frac14 namei typeiS i frac14
1y n nX1 where namei is the name of theproperty pi and typei can be
3
either a basic type3
3
or a data class
3
or a type set-of XS where XS can beeither a basic type or a data class
Definition 32 (D2Q Data Schema) A D2Q dataschema SD is a direct node- and edge-labelledgraph with the following features
a set of nodes ND each of which is a data class a set of leaves T D that are all basic type
properties a set of edges EDDND ethNDT D) a set of nodes and leaves labels LD frac14
LNDLT D where LND is the set of node labelseach node label consisting of dataclass name data classS LT D is the set ofleaves labels each leaf label consisting ofproperty name basic typeS
ARTICLE IN PRESS
EnterpriseEnterpriseClass
OwnerOwnerClass
Namestring Codestring
1 1
M Scannapieco et al Information Systems 29 (2004) 551ndash582558
a set of edge labels LED defined as follows Letn1 n2SAED Let n1 be a data class and let l12be the label on the edge n1 n2S Such a label isl12 frac14number of occurrences of n2 ie thelabel specifies the cardinality of n2 in therelationship with n1
4
FiscalCodestring
Addressstring Namestring
11
1
Fig 4 A D2Q data schema of the running example
A D2Q data schema must have the twofollowing properties
Closure Property If SD contains a data class dhaving as a property another data class d0 thenalso d0 belongs to SD
Aciclicity No cycle exists in a D2Q schemaIn Fig 4 the D2Q data model of the INPS local
schema of the running example is shown Specifi-cally two data classes enterprise and owner arerepresented The edge connecting enterprise toowner is labelled with indicating the n-arycardinality of owner the edges connecting eachdata class with the related basic type properties arelabelled with 1 indicating the cardinality of suchproperties
33 The D2Q quality model
This section describes the model we use torepresent quality data ie the D2Q quality modelfor which we define the concepts of quality class
and quality schemaBesides basic data types we define a set of types
that are called quality types Quality types corre-spond to the set of values that quality dimensionscan have We indicate as
-
4W
for m
01 1x and
1minim
taccuracy tcompleteness tconsistency tcurrency
the quality types considered in our modelMore specifically each quality type is defined
over a domain of values referred to as DomaccuracyDomcompleteness Domconsistency and Domcurrency Suchdomains are defined according to the metrics usedto measure quality values moreover each domainis enhanced with the special symbol gt to indicate
e assume that it is possible to specify the following values
inimum and maximum cardinality of relationships 11 (where stands for n-ary relationship) 0 xy (where
y are any fixed integer different from 0 and 1) 0y 1yIf no minimum cardinality is specified it is assumed the
um cardinality is 1
that no quality value is defined for that dimensionFor instance if we adopt a 3-value scale tomeasure accuracy then Domaccuracy frac14 fHighMedium Lowgtg
A quality class is associated to each data classand to each property We use the followingnotation if d ethnamed p1ypiypnTHORN is a data classthen ld is the associated quality class and ldpi
isthe quality class associated to pi In the followingwe will first define ldpi
where pi is a basic typeproperty then we define the quality class ld
Definition 33 (ldpi-Quality class associated to a
(basic type) property of a data class) Letd ethnamedp1y piypnTHORN be a data class Let pi
be a basic type property of dThe quality class ldpi
is defined by
a name nameldpi
four properties kaccuracy kcompleteness kconsistencykcurrency such that ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN
Definition 34 (ld-Quality class associated to adata class) Let d ethnamed p1ypnTHORN be adata class A quality class ld ethnameld kaccuracykcompletenesskconsistency kcurrencypl1yplnTHORN consists of
a name nameld a set of properties ethkaccuracykcompleteness
kconsistencykcurrencypl1y plnTHORN such that
3
kaccuracy kcompleteness kconsistency kcurrency are suchthat ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 559
3
ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as
ndash
if pi is a basic type property then typeli is
ldpi where ldpi
is the quality class asso-ciated to d pi
ndash
if pi is a set of XS type where X is abasic type then typeli is set of ldpi
Swhere ldpi
is the quality class associated tod pi
ndash
if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0
ndash
if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0
Definition 35 (D2Q Quality Schema) A D2Q
quality schema SQ is a direct node- and edge-
labelled graph with the following features
a set of nodes NQ each of which is a qualityclass
a set of leaves T Q that are quality typeproperties
a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q
where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels
Enterprise_Quality
Owner_QualityOwnerQualityClass
FiscalCode_QualityFiscalCodeQualityClass
Address_QualityAddressQualityClass
QualityN
Owner_Accuracyτ
hellip
hellip hellip
FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy
EnterprAccur
1 1 1
accuracy
Fig 5 A D2Q quality schema
consisting of quality property namequality typeS
a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows
3
Ente
Namame
a
hellipise_acy
of
if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality
3
if n2 is a quality class then l12 frac14number
of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1
Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality
are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the
rpriseQualityClass
e_QualityClass
Name_QualityName_QualityClass
Name_ccuracy
Code_QualityCodeQualityClass
hellip
hellip
hellip
Code_Accuracyτ
Name_Accuracyτ
1
1
accuracy
accuracy
the running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582560
domains of data quality dimension types to such ascope
34 The D2Q model data + quality
In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q
data schema and the D2Q quality schemaSpecifically we define the notion of quality
association between nodes of the two graphs
Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q
data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ
A quality association is a functionqualityAss A-B such that
(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function
It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions
Enterprise
Owner
NameCode
Owner_Quality
quality
qualityhelliphellip
quality
quality
1 1
Fig 6 A D2Q schema of t
Definition 37 (D2Q Schema) Given the D2Q
data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures
a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ
a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where
EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ
a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ
where LEDQ frac14 fqualityg is a set of quality
labels
The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)
341 D2Q schema instances
The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q
model
Enterprise_Quality
Name_Quality
Enterprise_accuracy
Name_accuracy
Code_Quality
Code_accuracy
1
1
1
hellip
hellip
hellip
he running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 561
More specifically
data classes instances are data objects similarly quality classes instances are quality
objects quality association values corresponds to qual-
ity links
A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type
property name basic type property valueSEdges are not labelled at all
Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type
property name quality type property valueSEdges are not labelled at all
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
Fig 7 D2Q schema instance
Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label
Fig 7 shows the graph representation of a D2Q
schema instance of the running example TheEnterprise1 object instance of the Enterprise
class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)
35 Implementation of the D2Q model
In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
of the running example
ARTICLE IN PRESS
Fig 8 An example of definition of a quality type as an XML
Schema type
M Scannapieco et al Information Systems 29 (2004) 551ndash582562
The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q
schema instances as XML documents is alsodescribed in the following of this section
351 From a D2Q schema to an XML schema
D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]
Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them
Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate
The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be
identified by explicitly introducing Object Identi-fiers (OIDs)
Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects
The principal choices for the representation ofD2Q schemas as XML Schemas were
representing data and quality classes and alsotheir properties as XML elements
limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association
definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model
definition of a root element for each D2Q dataschema
definition of a root element for each D2Q
quality schema
An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types
The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion
352 From a D2Q schema instance to an XML file
The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one
ARTICLE IN PRESS
Fig 9 XML Schema of the D2Q data schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 563
is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules
a root element is defined to contain all dataobjects
a root element is defined to contain all qualityobjects
an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality
a property of a data class is mapped into anXML element
similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it
a property of a quality class is also mapped intoan XML element
We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9
4 The Data Quality Broker
The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation
41 Overview
The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q
global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints
ARTICLE IN PRESS
Fig 10 XML Schema of the D2Q quality schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582564
Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway
in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried
A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater
ARTICLE IN PRESS
Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 565
mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)
The Data Quality Broker performs two specifictasks
query processing and quality improvement
The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43
42 Query processing
The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q
model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved
Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582566
and we select the copy to return as a result on thebasis of such values
Specifically the detailed steps we follow toanswer a query with quality constraints are
(1)
Table
Map
Glob
Enter
Enter
Enter
Hold
Hold
y
y
Let Q be a query posed on the global schemaG
(2)
The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources
(3)
-Name-ID-LegalAddress
Enterprise
The execution of the queries Q1yQn
returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]
(4)
-FiscalCode-Address-Name
Holder
1
Fig 12 Graphical representation of the global schema
The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values
1
ping between global and local schemas
al schema INPS schema
prise Enterprise
priseID EnterpriseCode
priseLegalAddress X
er Owner
erName OwnerName
y
y
421 Example of query processing
In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1
Let us consider the following XQuery query tobe posed on the global schema
for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)
Enterprise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-
INAIL schema CoC schema
Business Enterprise
BusinessCode EnterpriseCode
X EnterpriseLegalAddress
Owner Holder
OwnerName HolderName
y y
y y
ARTICLE IN PRESS
Accuracy
item
2
Acc
Name=
medium
Acc
Address
frac14medium
Acc
Namefrac14
high
Italyrsquorsquo
Acc
Address
frac14high
Acc
Namefrac14
low
Acc
Address
frac14low
M Scannapieco et al Information Systems 29 (2004) 551ndash582 567
sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL
iecorsquorsquo
84016Paganirsquorsquo
piecorsquorsquo
84016Pagani(SA)
Italyrsquorsquo
for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-
prise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
Table
2
Queryresultsin
asample
scenario
Organization
Holderitem
1Accuracy
item
1Holderitem
2
INAIL
lsquolsquoMaim
oMecellarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanap
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
INPS
lsquolsquoMMecallarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanna
lsquolsquoVia
dei
Gracchi71RomaItalyrsquorsquo
Acc
Address
frac14medium
lsquolsquoVia
Mazzini27
CoC
lsquolsquoMassim
oMecellarsquorsquo
Acc
Namefrac14
high
lsquolsquoMScannarsquorsquo
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high
Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data
Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582556
data and quality data (ie data quality dimensionvalues evaluated for the application data) accord-ing to a specific data model The model forexporting data and quality data is referred to asData and Data Quality ethD2QTHORN model In thissection we first introduce the data quality dimen-sions used in the model (Section 31) then wedescribe the D2Q model with respect to the datafeatures (Section 32) and the quality features(Section 33) Finally we describe implementationissues of the D2Q model in Section 35
31 Data quality dimensions
Data quality dimensions are properties of datasuch as correctness or degree of updating Thedata quality dimensions used in this work concernonly data values instead they do not deal withaspects concerning quality of logical schema anddata format [6]
In this section we propose some data qualitydimensions to be used in CISrsquos The need forproviding such definitions stems from the lack of acommon reference set of dimensions in the dataquality literature [5] and from the peculiarities ofCISrsquos we propose four dimensions inspired by thealready proposed definitions and stemming fromreal requirements of CISrsquos scenarios that weexperienced [2]
In the following the general concept of schema
element is used corresponding for instance to anentity in an Entity-Relationship schema or to aclass in a Unified Modeling Language diagramWe define (i) accuracy (ii) completeness (iii)currency and (iv) consistency
311 Accuracy
In [6] accuracy refers to the proximity of a valuev to a value v0 considered as correct Morespecifically
Accuracy is the distance between v and v0 being v0
the value considered as correctLet us consider the following example Citizen
is a schema element with an attribute Name and p
is an instance of Citizen If pName has a valuev frac14 JHN while v0 frac14 JOHN this is a case of lowaccuracy as JHN is not an admissible valueaccording to a dictionary of English names
Accuracy can be checked by comparing datavalues with reference dictionaries (eg namedictionaries address lists domain related diction-aries such as product or commercial categorieslists) As an example in the case of values on stringdomain edit distance functions can be used tosupport such a task [11] such functions considerthe minimum number of operations on individualcharacters needed in order to transform a stringinto another
Accuracy is more difficult to measure fornumerical data values for which more complexnotions of edit distances need to be defined
312 Completeness
Completeness is the degree to which values of a
schema element are present in the schema element
instanceAccording to such a definition we can consider
(i) the completeness of an attribute value asdependent from the fact that the attribute ispresent or not (ii) the completeness of a schemaelement instance as dependent from the number ofthe attribute values that are present
A recent work [12] distinguishes other kinds ofcompleteness namely schema completeness col-umn completeness and population completenessThe measures of such completeness types can givevery useful information for a lsquolsquogeneralrsquorsquo assessmentof the data completeness of an organization butas it will be clarified in Section 33 our focus inthis paper is on associating a quality evaluation onlsquolsquoeachrsquorsquo data a cooperating organization makesavailable to the others
In evaluating completeness it is important toconsider the meaning of null values of an attributedepending on the attribute being mandatoryoptional or inapplicable a null value for amandatory attribute is associated with a lowercompleteness whereas completeness is not affectedby optional or inapplicable null values
As an example consider the attribute E-mail ofthe Citizen schema element A null value forE-mail may have different meanings First thespecific citizen may have no e-mail address andtherefore the attribute is inapplicable this case hasno impact on completeness Second the specific
ARTICLE IN PRESS
3Basic types are the ones provided by the most common
programming languages and SQL that is Integer Real
Boolean String Date Time Interval Currency Any
M Scannapieco et al Information Systems 29 (2004) 551ndash582 557
citizen has an e-mail address which has not beenstored this is a case of low completeness
313 Currency
The currency dimension refers only to datavalues that may vary in time as an example valuesof Address may vary in time whereas DateOf-
Birth can be considered invariant Currency canbe defined as
Currency is the distance between the instant when
a value changes in the real world and the instant
when the value itself is modified in the information
systemMore complex definitions of currency have been
proposed in the literature [13] they are especiallysuited for specific types of data ie they take intoaccount elements such as lsquolsquovolatilityrsquorsquo of data (egdata such as stock price quotes that change veryfrequently) However in this work we stay withthe provided general definition of currency leavingits extensions to future work
Currency can be measured either by associatingto each value an lsquolsquoupdating time-stamprsquorsquo [14] or alsquolsquotransaction timersquorsquo like in temporal databases [15]
314 Consistency
Consistency implies that two or more values donot conflict each other
A semantic rule is a constraint that must holdamong values of attributes of a schema elementdepending on the application domain modeled bythe schema element Consistency can be defined as
Consistency is the degree to which the values of
the attributes of an instance of a schema element
satisfy the specific set of semantic rules defined on
the schema elementAs an example if we consider Citizen with
attributes Name DateOfBirth Sex and DateOf-
Death some possible semantic rules to be checkedare
the values of Name and Sex for an instance p areconsistent If pName has a value v frac14 JOHN andthe value of pSex is FEMALE this is a case oflow internal consistency
the value of pDateOfBirth must precede thevalue of pDateOfDeath
Consistency has been widely investigated both inthe database area through integrity constraints(eg [16]) and in the statistics area through editschecking (eg [17])
32 The D2Q data model
The D2Q model is a semistructured model thatenhances the semantics of the XML data model[18] in order to represent quality data In thefollowing of this section we provide a formaldefinition for a data class and a D2Q data schemaD2Q data schemas define the data portion of theD2Q model instead in Section 33 we define D2Q
quality schemas corresponding to the qualityportion of the D2Q model
Definition 31 (Data class) A data class
d ethnamedp1y pnTHORN consists of
a name named a set of properties pi frac14 namei typeiS i frac14
1y n nX1 where namei is the name of theproperty pi and typei can be
3
either a basic type3
3
or a data class
3
or a type set-of XS where XS can beeither a basic type or a data class
Definition 32 (D2Q Data Schema) A D2Q dataschema SD is a direct node- and edge-labelledgraph with the following features
a set of nodes ND each of which is a data class a set of leaves T D that are all basic type
properties a set of edges EDDND ethNDT D) a set of nodes and leaves labels LD frac14
LNDLT D where LND is the set of node labelseach node label consisting of dataclass name data classS LT D is the set ofleaves labels each leaf label consisting ofproperty name basic typeS
ARTICLE IN PRESS
EnterpriseEnterpriseClass
OwnerOwnerClass
Namestring Codestring
1 1
M Scannapieco et al Information Systems 29 (2004) 551ndash582558
a set of edge labels LED defined as follows Letn1 n2SAED Let n1 be a data class and let l12be the label on the edge n1 n2S Such a label isl12 frac14number of occurrences of n2 ie thelabel specifies the cardinality of n2 in therelationship with n1
4
FiscalCodestring
Addressstring Namestring
11
1
Fig 4 A D2Q data schema of the running example
A D2Q data schema must have the twofollowing properties
Closure Property If SD contains a data class dhaving as a property another data class d0 thenalso d0 belongs to SD
Aciclicity No cycle exists in a D2Q schemaIn Fig 4 the D2Q data model of the INPS local
schema of the running example is shown Specifi-cally two data classes enterprise and owner arerepresented The edge connecting enterprise toowner is labelled with indicating the n-arycardinality of owner the edges connecting eachdata class with the related basic type properties arelabelled with 1 indicating the cardinality of suchproperties
33 The D2Q quality model
This section describes the model we use torepresent quality data ie the D2Q quality modelfor which we define the concepts of quality class
and quality schemaBesides basic data types we define a set of types
that are called quality types Quality types corre-spond to the set of values that quality dimensionscan have We indicate as
-
4W
for m
01 1x and
1minim
taccuracy tcompleteness tconsistency tcurrency
the quality types considered in our modelMore specifically each quality type is defined
over a domain of values referred to as DomaccuracyDomcompleteness Domconsistency and Domcurrency Suchdomains are defined according to the metrics usedto measure quality values moreover each domainis enhanced with the special symbol gt to indicate
e assume that it is possible to specify the following values
inimum and maximum cardinality of relationships 11 (where stands for n-ary relationship) 0 xy (where
y are any fixed integer different from 0 and 1) 0y 1yIf no minimum cardinality is specified it is assumed the
um cardinality is 1
that no quality value is defined for that dimensionFor instance if we adopt a 3-value scale tomeasure accuracy then Domaccuracy frac14 fHighMedium Lowgtg
A quality class is associated to each data classand to each property We use the followingnotation if d ethnamed p1ypiypnTHORN is a data classthen ld is the associated quality class and ldpi
isthe quality class associated to pi In the followingwe will first define ldpi
where pi is a basic typeproperty then we define the quality class ld
Definition 33 (ldpi-Quality class associated to a
(basic type) property of a data class) Letd ethnamedp1y piypnTHORN be a data class Let pi
be a basic type property of dThe quality class ldpi
is defined by
a name nameldpi
four properties kaccuracy kcompleteness kconsistencykcurrency such that ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN
Definition 34 (ld-Quality class associated to adata class) Let d ethnamed p1ypnTHORN be adata class A quality class ld ethnameld kaccuracykcompletenesskconsistency kcurrencypl1yplnTHORN consists of
a name nameld a set of properties ethkaccuracykcompleteness
kconsistencykcurrencypl1y plnTHORN such that
3
kaccuracy kcompleteness kconsistency kcurrency are suchthat ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 559
3
ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as
ndash
if pi is a basic type property then typeli is
ldpi where ldpi
is the quality class asso-ciated to d pi
ndash
if pi is a set of XS type where X is abasic type then typeli is set of ldpi
Swhere ldpi
is the quality class associated tod pi
ndash
if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0
ndash
if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0
Definition 35 (D2Q Quality Schema) A D2Q
quality schema SQ is a direct node- and edge-
labelled graph with the following features
a set of nodes NQ each of which is a qualityclass
a set of leaves T Q that are quality typeproperties
a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q
where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels
Enterprise_Quality
Owner_QualityOwnerQualityClass
FiscalCode_QualityFiscalCodeQualityClass
Address_QualityAddressQualityClass
QualityN
Owner_Accuracyτ
hellip
hellip hellip
FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy
EnterprAccur
1 1 1
accuracy
Fig 5 A D2Q quality schema
consisting of quality property namequality typeS
a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows
3
Ente
Namame
a
hellipise_acy
of
if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality
3
if n2 is a quality class then l12 frac14number
of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1
Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality
are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the
rpriseQualityClass
e_QualityClass
Name_QualityName_QualityClass
Name_ccuracy
Code_QualityCodeQualityClass
hellip
hellip
hellip
Code_Accuracyτ
Name_Accuracyτ
1
1
accuracy
accuracy
the running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582560
domains of data quality dimension types to such ascope
34 The D2Q model data + quality
In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q
data schema and the D2Q quality schemaSpecifically we define the notion of quality
association between nodes of the two graphs
Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q
data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ
A quality association is a functionqualityAss A-B such that
(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function
It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions
Enterprise
Owner
NameCode
Owner_Quality
quality
qualityhelliphellip
quality
quality
1 1
Fig 6 A D2Q schema of t
Definition 37 (D2Q Schema) Given the D2Q
data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures
a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ
a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where
EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ
a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ
where LEDQ frac14 fqualityg is a set of quality
labels
The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)
341 D2Q schema instances
The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q
model
Enterprise_Quality
Name_Quality
Enterprise_accuracy
Name_accuracy
Code_Quality
Code_accuracy
1
1
1
hellip
hellip
hellip
he running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 561
More specifically
data classes instances are data objects similarly quality classes instances are quality
objects quality association values corresponds to qual-
ity links
A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type
property name basic type property valueSEdges are not labelled at all
Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type
property name quality type property valueSEdges are not labelled at all
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
Fig 7 D2Q schema instance
Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label
Fig 7 shows the graph representation of a D2Q
schema instance of the running example TheEnterprise1 object instance of the Enterprise
class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)
35 Implementation of the D2Q model
In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
of the running example
ARTICLE IN PRESS
Fig 8 An example of definition of a quality type as an XML
Schema type
M Scannapieco et al Information Systems 29 (2004) 551ndash582562
The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q
schema instances as XML documents is alsodescribed in the following of this section
351 From a D2Q schema to an XML schema
D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]
Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them
Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate
The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be
identified by explicitly introducing Object Identi-fiers (OIDs)
Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects
The principal choices for the representation ofD2Q schemas as XML Schemas were
representing data and quality classes and alsotheir properties as XML elements
limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association
definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model
definition of a root element for each D2Q dataschema
definition of a root element for each D2Q
quality schema
An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types
The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion
352 From a D2Q schema instance to an XML file
The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one
ARTICLE IN PRESS
Fig 9 XML Schema of the D2Q data schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 563
is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules
a root element is defined to contain all dataobjects
a root element is defined to contain all qualityobjects
an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality
a property of a data class is mapped into anXML element
similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it
a property of a quality class is also mapped intoan XML element
We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9
4 The Data Quality Broker
The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation
41 Overview
The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q
global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints
ARTICLE IN PRESS
Fig 10 XML Schema of the D2Q quality schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582564
Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway
in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried
A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater
ARTICLE IN PRESS
Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 565
mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)
The Data Quality Broker performs two specifictasks
query processing and quality improvement
The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43
42 Query processing
The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q
model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved
Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582566
and we select the copy to return as a result on thebasis of such values
Specifically the detailed steps we follow toanswer a query with quality constraints are
(1)
Table
Map
Glob
Enter
Enter
Enter
Hold
Hold
y
y
Let Q be a query posed on the global schemaG
(2)
The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources
(3)
-Name-ID-LegalAddress
Enterprise
The execution of the queries Q1yQn
returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]
(4)
-FiscalCode-Address-Name
Holder
1
Fig 12 Graphical representation of the global schema
The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values
1
ping between global and local schemas
al schema INPS schema
prise Enterprise
priseID EnterpriseCode
priseLegalAddress X
er Owner
erName OwnerName
y
y
421 Example of query processing
In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1
Let us consider the following XQuery query tobe posed on the global schema
for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)
Enterprise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-
INAIL schema CoC schema
Business Enterprise
BusinessCode EnterpriseCode
X EnterpriseLegalAddress
Owner Holder
OwnerName HolderName
y y
y y
ARTICLE IN PRESS
Accuracy
item
2
Acc
Name=
medium
Acc
Address
frac14medium
Acc
Namefrac14
high
Italyrsquorsquo
Acc
Address
frac14high
Acc
Namefrac14
low
Acc
Address
frac14low
M Scannapieco et al Information Systems 29 (2004) 551ndash582 567
sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL
iecorsquorsquo
84016Paganirsquorsquo
piecorsquorsquo
84016Pagani(SA)
Italyrsquorsquo
for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-
prise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
Table
2
Queryresultsin
asample
scenario
Organization
Holderitem
1Accuracy
item
1Holderitem
2
INAIL
lsquolsquoMaim
oMecellarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanap
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
INPS
lsquolsquoMMecallarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanna
lsquolsquoVia
dei
Gracchi71RomaItalyrsquorsquo
Acc
Address
frac14medium
lsquolsquoVia
Mazzini27
CoC
lsquolsquoMassim
oMecellarsquorsquo
Acc
Namefrac14
high
lsquolsquoMScannarsquorsquo
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high
Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data
Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
3Basic types are the ones provided by the most common
programming languages and SQL that is Integer Real
Boolean String Date Time Interval Currency Any
M Scannapieco et al Information Systems 29 (2004) 551ndash582 557
citizen has an e-mail address which has not beenstored this is a case of low completeness
313 Currency
The currency dimension refers only to datavalues that may vary in time as an example valuesof Address may vary in time whereas DateOf-
Birth can be considered invariant Currency canbe defined as
Currency is the distance between the instant when
a value changes in the real world and the instant
when the value itself is modified in the information
systemMore complex definitions of currency have been
proposed in the literature [13] they are especiallysuited for specific types of data ie they take intoaccount elements such as lsquolsquovolatilityrsquorsquo of data (egdata such as stock price quotes that change veryfrequently) However in this work we stay withthe provided general definition of currency leavingits extensions to future work
Currency can be measured either by associatingto each value an lsquolsquoupdating time-stamprsquorsquo [14] or alsquolsquotransaction timersquorsquo like in temporal databases [15]
314 Consistency
Consistency implies that two or more values donot conflict each other
A semantic rule is a constraint that must holdamong values of attributes of a schema elementdepending on the application domain modeled bythe schema element Consistency can be defined as
Consistency is the degree to which the values of
the attributes of an instance of a schema element
satisfy the specific set of semantic rules defined on
the schema elementAs an example if we consider Citizen with
attributes Name DateOfBirth Sex and DateOf-
Death some possible semantic rules to be checkedare
the values of Name and Sex for an instance p areconsistent If pName has a value v frac14 JOHN andthe value of pSex is FEMALE this is a case oflow internal consistency
the value of pDateOfBirth must precede thevalue of pDateOfDeath
Consistency has been widely investigated both inthe database area through integrity constraints(eg [16]) and in the statistics area through editschecking (eg [17])
32 The D2Q data model
The D2Q model is a semistructured model thatenhances the semantics of the XML data model[18] in order to represent quality data In thefollowing of this section we provide a formaldefinition for a data class and a D2Q data schemaD2Q data schemas define the data portion of theD2Q model instead in Section 33 we define D2Q
quality schemas corresponding to the qualityportion of the D2Q model
Definition 31 (Data class) A data class
d ethnamedp1y pnTHORN consists of
a name named a set of properties pi frac14 namei typeiS i frac14
1y n nX1 where namei is the name of theproperty pi and typei can be
3
either a basic type3
3
or a data class
3
or a type set-of XS where XS can beeither a basic type or a data class
Definition 32 (D2Q Data Schema) A D2Q dataschema SD is a direct node- and edge-labelledgraph with the following features
a set of nodes ND each of which is a data class a set of leaves T D that are all basic type
properties a set of edges EDDND ethNDT D) a set of nodes and leaves labels LD frac14
LNDLT D where LND is the set of node labelseach node label consisting of dataclass name data classS LT D is the set ofleaves labels each leaf label consisting ofproperty name basic typeS
ARTICLE IN PRESS
EnterpriseEnterpriseClass
OwnerOwnerClass
Namestring Codestring
1 1
M Scannapieco et al Information Systems 29 (2004) 551ndash582558
a set of edge labels LED defined as follows Letn1 n2SAED Let n1 be a data class and let l12be the label on the edge n1 n2S Such a label isl12 frac14number of occurrences of n2 ie thelabel specifies the cardinality of n2 in therelationship with n1
4
FiscalCodestring
Addressstring Namestring
11
1
Fig 4 A D2Q data schema of the running example
A D2Q data schema must have the twofollowing properties
Closure Property If SD contains a data class dhaving as a property another data class d0 thenalso d0 belongs to SD
Aciclicity No cycle exists in a D2Q schemaIn Fig 4 the D2Q data model of the INPS local
schema of the running example is shown Specifi-cally two data classes enterprise and owner arerepresented The edge connecting enterprise toowner is labelled with indicating the n-arycardinality of owner the edges connecting eachdata class with the related basic type properties arelabelled with 1 indicating the cardinality of suchproperties
33 The D2Q quality model
This section describes the model we use torepresent quality data ie the D2Q quality modelfor which we define the concepts of quality class
and quality schemaBesides basic data types we define a set of types
that are called quality types Quality types corre-spond to the set of values that quality dimensionscan have We indicate as
-
4W
for m
01 1x and
1minim
taccuracy tcompleteness tconsistency tcurrency
the quality types considered in our modelMore specifically each quality type is defined
over a domain of values referred to as DomaccuracyDomcompleteness Domconsistency and Domcurrency Suchdomains are defined according to the metrics usedto measure quality values moreover each domainis enhanced with the special symbol gt to indicate
e assume that it is possible to specify the following values
inimum and maximum cardinality of relationships 11 (where stands for n-ary relationship) 0 xy (where
y are any fixed integer different from 0 and 1) 0y 1yIf no minimum cardinality is specified it is assumed the
um cardinality is 1
that no quality value is defined for that dimensionFor instance if we adopt a 3-value scale tomeasure accuracy then Domaccuracy frac14 fHighMedium Lowgtg
A quality class is associated to each data classand to each property We use the followingnotation if d ethnamed p1ypiypnTHORN is a data classthen ld is the associated quality class and ldpi
isthe quality class associated to pi In the followingwe will first define ldpi
where pi is a basic typeproperty then we define the quality class ld
Definition 33 (ldpi-Quality class associated to a
(basic type) property of a data class) Letd ethnamedp1y piypnTHORN be a data class Let pi
be a basic type property of dThe quality class ldpi
is defined by
a name nameldpi
four properties kaccuracy kcompleteness kconsistencykcurrency such that ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN
Definition 34 (ld-Quality class associated to adata class) Let d ethnamed p1ypnTHORN be adata class A quality class ld ethnameld kaccuracykcompletenesskconsistency kcurrencypl1yplnTHORN consists of
a name nameld a set of properties ethkaccuracykcompleteness
kconsistencykcurrencypl1y plnTHORN such that
3
kaccuracy kcompleteness kconsistency kcurrency are suchthat ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 559
3
ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as
ndash
if pi is a basic type property then typeli is
ldpi where ldpi
is the quality class asso-ciated to d pi
ndash
if pi is a set of XS type where X is abasic type then typeli is set of ldpi
Swhere ldpi
is the quality class associated tod pi
ndash
if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0
ndash
if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0
Definition 35 (D2Q Quality Schema) A D2Q
quality schema SQ is a direct node- and edge-
labelled graph with the following features
a set of nodes NQ each of which is a qualityclass
a set of leaves T Q that are quality typeproperties
a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q
where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels
Enterprise_Quality
Owner_QualityOwnerQualityClass
FiscalCode_QualityFiscalCodeQualityClass
Address_QualityAddressQualityClass
QualityN
Owner_Accuracyτ
hellip
hellip hellip
FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy
EnterprAccur
1 1 1
accuracy
Fig 5 A D2Q quality schema
consisting of quality property namequality typeS
a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows
3
Ente
Namame
a
hellipise_acy
of
if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality
3
if n2 is a quality class then l12 frac14number
of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1
Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality
are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the
rpriseQualityClass
e_QualityClass
Name_QualityName_QualityClass
Name_ccuracy
Code_QualityCodeQualityClass
hellip
hellip
hellip
Code_Accuracyτ
Name_Accuracyτ
1
1
accuracy
accuracy
the running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582560
domains of data quality dimension types to such ascope
34 The D2Q model data + quality
In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q
data schema and the D2Q quality schemaSpecifically we define the notion of quality
association between nodes of the two graphs
Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q
data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ
A quality association is a functionqualityAss A-B such that
(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function
It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions
Enterprise
Owner
NameCode
Owner_Quality
quality
qualityhelliphellip
quality
quality
1 1
Fig 6 A D2Q schema of t
Definition 37 (D2Q Schema) Given the D2Q
data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures
a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ
a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where
EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ
a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ
where LEDQ frac14 fqualityg is a set of quality
labels
The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)
341 D2Q schema instances
The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q
model
Enterprise_Quality
Name_Quality
Enterprise_accuracy
Name_accuracy
Code_Quality
Code_accuracy
1
1
1
hellip
hellip
hellip
he running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 561
More specifically
data classes instances are data objects similarly quality classes instances are quality
objects quality association values corresponds to qual-
ity links
A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type
property name basic type property valueSEdges are not labelled at all
Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type
property name quality type property valueSEdges are not labelled at all
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
Fig 7 D2Q schema instance
Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label
Fig 7 shows the graph representation of a D2Q
schema instance of the running example TheEnterprise1 object instance of the Enterprise
class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)
35 Implementation of the D2Q model
In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
of the running example
ARTICLE IN PRESS
Fig 8 An example of definition of a quality type as an XML
Schema type
M Scannapieco et al Information Systems 29 (2004) 551ndash582562
The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q
schema instances as XML documents is alsodescribed in the following of this section
351 From a D2Q schema to an XML schema
D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]
Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them
Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate
The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be
identified by explicitly introducing Object Identi-fiers (OIDs)
Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects
The principal choices for the representation ofD2Q schemas as XML Schemas were
representing data and quality classes and alsotheir properties as XML elements
limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association
definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model
definition of a root element for each D2Q dataschema
definition of a root element for each D2Q
quality schema
An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types
The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion
352 From a D2Q schema instance to an XML file
The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one
ARTICLE IN PRESS
Fig 9 XML Schema of the D2Q data schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 563
is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules
a root element is defined to contain all dataobjects
a root element is defined to contain all qualityobjects
an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality
a property of a data class is mapped into anXML element
similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it
a property of a quality class is also mapped intoan XML element
We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9
4 The Data Quality Broker
The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation
41 Overview
The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q
global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints
ARTICLE IN PRESS
Fig 10 XML Schema of the D2Q quality schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582564
Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway
in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried
A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater
ARTICLE IN PRESS
Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 565
mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)
The Data Quality Broker performs two specifictasks
query processing and quality improvement
The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43
42 Query processing
The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q
model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved
Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582566
and we select the copy to return as a result on thebasis of such values
Specifically the detailed steps we follow toanswer a query with quality constraints are
(1)
Table
Map
Glob
Enter
Enter
Enter
Hold
Hold
y
y
Let Q be a query posed on the global schemaG
(2)
The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources
(3)
-Name-ID-LegalAddress
Enterprise
The execution of the queries Q1yQn
returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]
(4)
-FiscalCode-Address-Name
Holder
1
Fig 12 Graphical representation of the global schema
The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values
1
ping between global and local schemas
al schema INPS schema
prise Enterprise
priseID EnterpriseCode
priseLegalAddress X
er Owner
erName OwnerName
y
y
421 Example of query processing
In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1
Let us consider the following XQuery query tobe posed on the global schema
for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)
Enterprise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-
INAIL schema CoC schema
Business Enterprise
BusinessCode EnterpriseCode
X EnterpriseLegalAddress
Owner Holder
OwnerName HolderName
y y
y y
ARTICLE IN PRESS
Accuracy
item
2
Acc
Name=
medium
Acc
Address
frac14medium
Acc
Namefrac14
high
Italyrsquorsquo
Acc
Address
frac14high
Acc
Namefrac14
low
Acc
Address
frac14low
M Scannapieco et al Information Systems 29 (2004) 551ndash582 567
sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL
iecorsquorsquo
84016Paganirsquorsquo
piecorsquorsquo
84016Pagani(SA)
Italyrsquorsquo
for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-
prise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
Table
2
Queryresultsin
asample
scenario
Organization
Holderitem
1Accuracy
item
1Holderitem
2
INAIL
lsquolsquoMaim
oMecellarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanap
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
INPS
lsquolsquoMMecallarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanna
lsquolsquoVia
dei
Gracchi71RomaItalyrsquorsquo
Acc
Address
frac14medium
lsquolsquoVia
Mazzini27
CoC
lsquolsquoMassim
oMecellarsquorsquo
Acc
Namefrac14
high
lsquolsquoMScannarsquorsquo
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high
Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data
Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
EnterpriseEnterpriseClass
OwnerOwnerClass
Namestring Codestring
1 1
M Scannapieco et al Information Systems 29 (2004) 551ndash582558
a set of edge labels LED defined as follows Letn1 n2SAED Let n1 be a data class and let l12be the label on the edge n1 n2S Such a label isl12 frac14number of occurrences of n2 ie thelabel specifies the cardinality of n2 in therelationship with n1
4
FiscalCodestring
Addressstring Namestring
11
1
Fig 4 A D2Q data schema of the running example
A D2Q data schema must have the twofollowing properties
Closure Property If SD contains a data class dhaving as a property another data class d0 thenalso d0 belongs to SD
Aciclicity No cycle exists in a D2Q schemaIn Fig 4 the D2Q data model of the INPS local
schema of the running example is shown Specifi-cally two data classes enterprise and owner arerepresented The edge connecting enterprise toowner is labelled with indicating the n-arycardinality of owner the edges connecting eachdata class with the related basic type properties arelabelled with 1 indicating the cardinality of suchproperties
33 The D2Q quality model
This section describes the model we use torepresent quality data ie the D2Q quality modelfor which we define the concepts of quality class
and quality schemaBesides basic data types we define a set of types
that are called quality types Quality types corre-spond to the set of values that quality dimensionscan have We indicate as
-
4W
for m
01 1x and
1minim
taccuracy tcompleteness tconsistency tcurrency
the quality types considered in our modelMore specifically each quality type is defined
over a domain of values referred to as DomaccuracyDomcompleteness Domconsistency and Domcurrency Suchdomains are defined according to the metrics usedto measure quality values moreover each domainis enhanced with the special symbol gt to indicate
e assume that it is possible to specify the following values
inimum and maximum cardinality of relationships 11 (where stands for n-ary relationship) 0 xy (where
y are any fixed integer different from 0 and 1) 0y 1yIf no minimum cardinality is specified it is assumed the
um cardinality is 1
that no quality value is defined for that dimensionFor instance if we adopt a 3-value scale tomeasure accuracy then Domaccuracy frac14 fHighMedium Lowgtg
A quality class is associated to each data classand to each property We use the followingnotation if d ethnamed p1ypiypnTHORN is a data classthen ld is the associated quality class and ldpi
isthe quality class associated to pi In the followingwe will first define ldpi
where pi is a basic typeproperty then we define the quality class ld
Definition 33 (ldpi-Quality class associated to a
(basic type) property of a data class) Letd ethnamedp1y piypnTHORN be a data class Let pi
be a basic type property of dThe quality class ldpi
is defined by
a name nameldpi
four properties kaccuracy kcompleteness kconsistencykcurrency such that ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN
Definition 34 (ld-Quality class associated to adata class) Let d ethnamed p1ypnTHORN be adata class A quality class ld ethnameld kaccuracykcompletenesskconsistency kcurrencypl1yplnTHORN consists of
a name nameld a set of properties ethkaccuracykcompleteness
kconsistencykcurrencypl1y plnTHORN such that
3
kaccuracy kcompleteness kconsistency kcurrency are suchthat ethkaccuracy taccuracyTHORNethkcompleteness tcompletenessTHORNethkconsistency tconsistencyTHORNethkcurrency tcurrencyTHORN
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 559
3
ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as
ndash
if pi is a basic type property then typeli is
ldpi where ldpi
is the quality class asso-ciated to d pi
ndash
if pi is a set of XS type where X is abasic type then typeli is set of ldpi
Swhere ldpi
is the quality class associated tod pi
ndash
if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0
ndash
if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0
Definition 35 (D2Q Quality Schema) A D2Q
quality schema SQ is a direct node- and edge-
labelled graph with the following features
a set of nodes NQ each of which is a qualityclass
a set of leaves T Q that are quality typeproperties
a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q
where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels
Enterprise_Quality
Owner_QualityOwnerQualityClass
FiscalCode_QualityFiscalCodeQualityClass
Address_QualityAddressQualityClass
QualityN
Owner_Accuracyτ
hellip
hellip hellip
FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy
EnterprAccur
1 1 1
accuracy
Fig 5 A D2Q quality schema
consisting of quality property namequality typeS
a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows
3
Ente
Namame
a
hellipise_acy
of
if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality
3
if n2 is a quality class then l12 frac14number
of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1
Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality
are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the
rpriseQualityClass
e_QualityClass
Name_QualityName_QualityClass
Name_ccuracy
Code_QualityCodeQualityClass
hellip
hellip
hellip
Code_Accuracyτ
Name_Accuracyτ
1
1
accuracy
accuracy
the running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582560
domains of data quality dimension types to such ascope
34 The D2Q model data + quality
In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q
data schema and the D2Q quality schemaSpecifically we define the notion of quality
association between nodes of the two graphs
Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q
data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ
A quality association is a functionqualityAss A-B such that
(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function
It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions
Enterprise
Owner
NameCode
Owner_Quality
quality
qualityhelliphellip
quality
quality
1 1
Fig 6 A D2Q schema of t
Definition 37 (D2Q Schema) Given the D2Q
data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures
a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ
a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where
EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ
a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ
where LEDQ frac14 fqualityg is a set of quality
labels
The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)
341 D2Q schema instances
The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q
model
Enterprise_Quality
Name_Quality
Enterprise_accuracy
Name_accuracy
Code_Quality
Code_accuracy
1
1
1
hellip
hellip
hellip
he running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 561
More specifically
data classes instances are data objects similarly quality classes instances are quality
objects quality association values corresponds to qual-
ity links
A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type
property name basic type property valueSEdges are not labelled at all
Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type
property name quality type property valueSEdges are not labelled at all
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
Fig 7 D2Q schema instance
Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label
Fig 7 shows the graph representation of a D2Q
schema instance of the running example TheEnterprise1 object instance of the Enterprise
class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)
35 Implementation of the D2Q model
In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
of the running example
ARTICLE IN PRESS
Fig 8 An example of definition of a quality type as an XML
Schema type
M Scannapieco et al Information Systems 29 (2004) 551ndash582562
The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q
schema instances as XML documents is alsodescribed in the following of this section
351 From a D2Q schema to an XML schema
D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]
Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them
Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate
The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be
identified by explicitly introducing Object Identi-fiers (OIDs)
Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects
The principal choices for the representation ofD2Q schemas as XML Schemas were
representing data and quality classes and alsotheir properties as XML elements
limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association
definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model
definition of a root element for each D2Q dataschema
definition of a root element for each D2Q
quality schema
An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types
The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion
352 From a D2Q schema instance to an XML file
The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one
ARTICLE IN PRESS
Fig 9 XML Schema of the D2Q data schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 563
is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules
a root element is defined to contain all dataobjects
a root element is defined to contain all qualityobjects
an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality
a property of a data class is mapped into anXML element
similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it
a property of a quality class is also mapped intoan XML element
We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9
4 The Data Quality Broker
The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation
41 Overview
The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q
global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints
ARTICLE IN PRESS
Fig 10 XML Schema of the D2Q quality schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582564
Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway
in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried
A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater
ARTICLE IN PRESS
Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 565
mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)
The Data Quality Broker performs two specifictasks
query processing and quality improvement
The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43
42 Query processing
The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q
model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved
Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582566
and we select the copy to return as a result on thebasis of such values
Specifically the detailed steps we follow toanswer a query with quality constraints are
(1)
Table
Map
Glob
Enter
Enter
Enter
Hold
Hold
y
y
Let Q be a query posed on the global schemaG
(2)
The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources
(3)
-Name-ID-LegalAddress
Enterprise
The execution of the queries Q1yQn
returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]
(4)
-FiscalCode-Address-Name
Holder
1
Fig 12 Graphical representation of the global schema
The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values
1
ping between global and local schemas
al schema INPS schema
prise Enterprise
priseID EnterpriseCode
priseLegalAddress X
er Owner
erName OwnerName
y
y
421 Example of query processing
In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1
Let us consider the following XQuery query tobe posed on the global schema
for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)
Enterprise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-
INAIL schema CoC schema
Business Enterprise
BusinessCode EnterpriseCode
X EnterpriseLegalAddress
Owner Holder
OwnerName HolderName
y y
y y
ARTICLE IN PRESS
Accuracy
item
2
Acc
Name=
medium
Acc
Address
frac14medium
Acc
Namefrac14
high
Italyrsquorsquo
Acc
Address
frac14high
Acc
Namefrac14
low
Acc
Address
frac14low
M Scannapieco et al Information Systems 29 (2004) 551ndash582 567
sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL
iecorsquorsquo
84016Paganirsquorsquo
piecorsquorsquo
84016Pagani(SA)
Italyrsquorsquo
for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-
prise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
Table
2
Queryresultsin
asample
scenario
Organization
Holderitem
1Accuracy
item
1Holderitem
2
INAIL
lsquolsquoMaim
oMecellarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanap
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
INPS
lsquolsquoMMecallarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanna
lsquolsquoVia
dei
Gracchi71RomaItalyrsquorsquo
Acc
Address
frac14medium
lsquolsquoVia
Mazzini27
CoC
lsquolsquoMassim
oMecellarsquorsquo
Acc
Namefrac14
high
lsquolsquoMScannarsquorsquo
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high
Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data
Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 559
3
ethpl1yplnTHORN is a set of properties pli frac14 nameli typeli S that biunivocally corresponds toethp1ypnTHORN (properties of d) and is defined as
ndash
if pi is a basic type property then typeli is
ldpi where ldpi
is the quality class asso-ciated to d pi
ndash
if pi is a set of XS type where X is abasic type then typeli is set of ldpi
Swhere ldpi
is the quality class associated tod pi
ndash
if pi is a data class d0 then typeli is ld0 whereld0 is the quality class associated to d0
ndash
if pi is a set of XS type where X is a dataclass d0 then typeli is set of ld0S where ld0is the quality class associated to d0
Definition 35 (D2Q Quality Schema) A D2Q
quality schema SQ is a direct node- and edge-
labelled graph with the following features
a set of nodes NQ each of which is a qualityclass
a set of leaves T Q that are quality typeproperties
a set of edges EQDNQ ethNQT QTHORN a set of node and leaves labels LQ frac14 LNQT Q
where LNQ is the set of node labels each nodelabel consisting of quality class namequality classS LT Q is the set of leaves labels
Enterprise_Quality
Owner_QualityOwnerQualityClass
FiscalCode_QualityFiscalCodeQualityClass
Address_QualityAddressQualityClass
QualityN
Owner_Accuracyτ
hellip
hellip hellip
FiscalCode_AccuracyτaccuracyAddress_Accuracyτaccuracy
EnterprAccur
1 1 1
accuracy
Fig 5 A D2Q quality schema
consisting of quality property namequality typeS
a set of edge labels LEQ defined as follows Letn1 n2SAEQ Let n1 be a quality class andlet l12 be the label on the edge n1 n2Ssuch a label is defined in dependance on n2 asfollows
3
Ente
Namame
a
hellipise_acy
of
if n2 is a leaf then l12 is the NULL string as wedo not need to specify any cardinality
3
if n2 is a quality class then l12 frac14number
of occurrences of n2 ie the label specifiesthe cardinality of n2 in the relationshipwith n1
Let us consider again the INPS local schema of therunning example as seen the D2Q data schema isshown in Fig 4 In Fig 5 we instead represent theD2Q quality schema Specifically two qualityclasses Enterprise Quality and Owner Quality
are associated to the Enterprise and Owner dataclasses Accuracy nodes are shown for bothdata classes and related properties Note thatalso consistency currency and completenessnodes are present in the graph though notshown this however does not prevent from thepossibility of not specifying quality valuesas the special symbol gt has been added to the
rpriseQualityClass
e_QualityClass
Name_QualityName_QualityClass
Name_ccuracy
Code_QualityCodeQualityClass
hellip
hellip
hellip
Code_Accuracyτ
Name_Accuracyτ
1
1
accuracy
accuracy
the running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582560
domains of data quality dimension types to such ascope
34 The D2Q model data + quality
In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q
data schema and the D2Q quality schemaSpecifically we define the notion of quality
association between nodes of the two graphs
Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q
data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ
A quality association is a functionqualityAss A-B such that
(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function
It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions
Enterprise
Owner
NameCode
Owner_Quality
quality
qualityhelliphellip
quality
quality
1 1
Fig 6 A D2Q schema of t
Definition 37 (D2Q Schema) Given the D2Q
data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures
a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ
a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where
EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ
a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ
where LEDQ frac14 fqualityg is a set of quality
labels
The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)
341 D2Q schema instances
The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q
model
Enterprise_Quality
Name_Quality
Enterprise_accuracy
Name_accuracy
Code_Quality
Code_accuracy
1
1
1
hellip
hellip
hellip
he running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 561
More specifically
data classes instances are data objects similarly quality classes instances are quality
objects quality association values corresponds to qual-
ity links
A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type
property name basic type property valueSEdges are not labelled at all
Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type
property name quality type property valueSEdges are not labelled at all
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
Fig 7 D2Q schema instance
Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label
Fig 7 shows the graph representation of a D2Q
schema instance of the running example TheEnterprise1 object instance of the Enterprise
class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)
35 Implementation of the D2Q model
In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
of the running example
ARTICLE IN PRESS
Fig 8 An example of definition of a quality type as an XML
Schema type
M Scannapieco et al Information Systems 29 (2004) 551ndash582562
The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q
schema instances as XML documents is alsodescribed in the following of this section
351 From a D2Q schema to an XML schema
D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]
Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them
Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate
The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be
identified by explicitly introducing Object Identi-fiers (OIDs)
Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects
The principal choices for the representation ofD2Q schemas as XML Schemas were
representing data and quality classes and alsotheir properties as XML elements
limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association
definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model
definition of a root element for each D2Q dataschema
definition of a root element for each D2Q
quality schema
An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types
The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion
352 From a D2Q schema instance to an XML file
The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one
ARTICLE IN PRESS
Fig 9 XML Schema of the D2Q data schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 563
is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules
a root element is defined to contain all dataobjects
a root element is defined to contain all qualityobjects
an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality
a property of a data class is mapped into anXML element
similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it
a property of a quality class is also mapped intoan XML element
We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9
4 The Data Quality Broker
The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation
41 Overview
The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q
global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints
ARTICLE IN PRESS
Fig 10 XML Schema of the D2Q quality schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582564
Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway
in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried
A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater
ARTICLE IN PRESS
Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 565
mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)
The Data Quality Broker performs two specifictasks
query processing and quality improvement
The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43
42 Query processing
The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q
model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved
Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582566
and we select the copy to return as a result on thebasis of such values
Specifically the detailed steps we follow toanswer a query with quality constraints are
(1)
Table
Map
Glob
Enter
Enter
Enter
Hold
Hold
y
y
Let Q be a query posed on the global schemaG
(2)
The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources
(3)
-Name-ID-LegalAddress
Enterprise
The execution of the queries Q1yQn
returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]
(4)
-FiscalCode-Address-Name
Holder
1
Fig 12 Graphical representation of the global schema
The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values
1
ping between global and local schemas
al schema INPS schema
prise Enterprise
priseID EnterpriseCode
priseLegalAddress X
er Owner
erName OwnerName
y
y
421 Example of query processing
In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1
Let us consider the following XQuery query tobe posed on the global schema
for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)
Enterprise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-
INAIL schema CoC schema
Business Enterprise
BusinessCode EnterpriseCode
X EnterpriseLegalAddress
Owner Holder
OwnerName HolderName
y y
y y
ARTICLE IN PRESS
Accuracy
item
2
Acc
Name=
medium
Acc
Address
frac14medium
Acc
Namefrac14
high
Italyrsquorsquo
Acc
Address
frac14high
Acc
Namefrac14
low
Acc
Address
frac14low
M Scannapieco et al Information Systems 29 (2004) 551ndash582 567
sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL
iecorsquorsquo
84016Paganirsquorsquo
piecorsquorsquo
84016Pagani(SA)
Italyrsquorsquo
for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-
prise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
Table
2
Queryresultsin
asample
scenario
Organization
Holderitem
1Accuracy
item
1Holderitem
2
INAIL
lsquolsquoMaim
oMecellarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanap
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
INPS
lsquolsquoMMecallarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanna
lsquolsquoVia
dei
Gracchi71RomaItalyrsquorsquo
Acc
Address
frac14medium
lsquolsquoVia
Mazzini27
CoC
lsquolsquoMassim
oMecellarsquorsquo
Acc
Namefrac14
high
lsquolsquoMScannarsquorsquo
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high
Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data
Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582560
domains of data quality dimension types to such ascope
34 The D2Q model data + quality
In the previous section the D2Q quality schemahas been defined according to a correspondencewith the D2Q data schema In this section weformally define the relationship between D2Q
data schema and the D2Q quality schemaSpecifically we define the notion of quality
association between nodes of the two graphs
Definition 36 (Quality association) Let us callA frac14 fNDT Dg the set of all nodes of a D2Q
data schema SD let us call B frac14 NQ the setof all non-leaf nodes of a D2Q quality schemaSQ
A quality association is a functionqualityAss A-B such that
(8xAA( a unique yAB such that y frac14qualityAssethxTHORN) 4 (8yAB (a unique xAA suchthat y frac14 qualityAssethxTHORN) ) qualityAss is abiunivocal function
It is easy to define now a D2Q schema thatcomprises both data and associated quality defini-tions
Enterprise
Owner
NameCode
Owner_Quality
quality
qualityhelliphellip
quality
quality
1 1
Fig 6 A D2Q schema of t
Definition 37 (D2Q Schema) Given the D2Q
data schema SD and the D2Q quality schema SQa D2Q Schema S is a direct node- andedge-labelled graph with the followingfeatures
a set of nodes N frac14 fNDNQg deriving fromthe union of nodes from SD and nodes from SQ
a set of leaves T frac14 T DT Q a set of edges E frac14 fEDEQEDQg where
EDQDfNDT Dg NQ specifically specialedges connecting nodes from SD to nodes fromSQ are introduced whereas qualityAss isdefined such edges are added to the union ofedges from SD and edges from SQ
a set of node labels LN frac14 LDLQ a set of edge labels LE frac14 LEDLEQLEDQ
where LEDQ frac14 fqualityg is a set of quality
labels
The D2Q schema related to our exampleschemas in Figs 4 and 5 is shown in Fig 6 (forthe sake of clearness types are not shown)
341 D2Q schema instances
The instances of both D2Q data schemas andD2Q quality schemas can be easily derivedaccording to the class-based structure of the D2Q
model
Enterprise_Quality
Name_Quality
Enterprise_accuracy
Name_accuracy
Code_Quality
Code_accuracy
1
1
1
hellip
hellip
hellip
he running example
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 561
More specifically
data classes instances are data objects similarly quality classes instances are quality
objects quality association values corresponds to qual-
ity links
A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type
property name basic type property valueSEdges are not labelled at all
Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type
property name quality type property valueSEdges are not labelled at all
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
Fig 7 D2Q schema instance
Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label
Fig 7 shows the graph representation of a D2Q
schema instance of the running example TheEnterprise1 object instance of the Enterprise
class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)
35 Implementation of the D2Q model
In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
of the running example
ARTICLE IN PRESS
Fig 8 An example of definition of a quality type as an XML
Schema type
M Scannapieco et al Information Systems 29 (2004) 551ndash582562
The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q
schema instances as XML documents is alsodescribed in the following of this section
351 From a D2Q schema to an XML schema
D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]
Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them
Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate
The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be
identified by explicitly introducing Object Identi-fiers (OIDs)
Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects
The principal choices for the representation ofD2Q schemas as XML Schemas were
representing data and quality classes and alsotheir properties as XML elements
limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association
definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model
definition of a root element for each D2Q dataschema
definition of a root element for each D2Q
quality schema
An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types
The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion
352 From a D2Q schema instance to an XML file
The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one
ARTICLE IN PRESS
Fig 9 XML Schema of the D2Q data schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 563
is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules
a root element is defined to contain all dataobjects
a root element is defined to contain all qualityobjects
an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality
a property of a data class is mapped into anXML element
similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it
a property of a quality class is also mapped intoan XML element
We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9
4 The Data Quality Broker
The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation
41 Overview
The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q
global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints
ARTICLE IN PRESS
Fig 10 XML Schema of the D2Q quality schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582564
Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway
in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried
A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater
ARTICLE IN PRESS
Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 565
mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)
The Data Quality Broker performs two specifictasks
query processing and quality improvement
The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43
42 Query processing
The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q
model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved
Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582566
and we select the copy to return as a result on thebasis of such values
Specifically the detailed steps we follow toanswer a query with quality constraints are
(1)
Table
Map
Glob
Enter
Enter
Enter
Hold
Hold
y
y
Let Q be a query posed on the global schemaG
(2)
The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources
(3)
-Name-ID-LegalAddress
Enterprise
The execution of the queries Q1yQn
returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]
(4)
-FiscalCode-Address-Name
Holder
1
Fig 12 Graphical representation of the global schema
The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values
1
ping between global and local schemas
al schema INPS schema
prise Enterprise
priseID EnterpriseCode
priseLegalAddress X
er Owner
erName OwnerName
y
y
421 Example of query processing
In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1
Let us consider the following XQuery query tobe posed on the global schema
for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)
Enterprise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-
INAIL schema CoC schema
Business Enterprise
BusinessCode EnterpriseCode
X EnterpriseLegalAddress
Owner Holder
OwnerName HolderName
y y
y y
ARTICLE IN PRESS
Accuracy
item
2
Acc
Name=
medium
Acc
Address
frac14medium
Acc
Namefrac14
high
Italyrsquorsquo
Acc
Address
frac14high
Acc
Namefrac14
low
Acc
Address
frac14low
M Scannapieco et al Information Systems 29 (2004) 551ndash582 567
sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL
iecorsquorsquo
84016Paganirsquorsquo
piecorsquorsquo
84016Pagani(SA)
Italyrsquorsquo
for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-
prise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
Table
2
Queryresultsin
asample
scenario
Organization
Holderitem
1Accuracy
item
1Holderitem
2
INAIL
lsquolsquoMaim
oMecellarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanap
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
INPS
lsquolsquoMMecallarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanna
lsquolsquoVia
dei
Gracchi71RomaItalyrsquorsquo
Acc
Address
frac14medium
lsquolsquoVia
Mazzini27
CoC
lsquolsquoMassim
oMecellarsquorsquo
Acc
Namefrac14
high
lsquolsquoMScannarsquorsquo
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high
Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data
Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 561
More specifically
data classes instances are data objects similarly quality classes instances are quality
objects quality association values corresponds to qual-
ity links
A D2Q data schema instance is a graph withnodes corresponding to data objects and leavescorresponding to properties values More specifi-cally nodes are labelled with data objects namesand leaves are labelled with a couple basic type
property name basic type property valueSEdges are not labelled at all
Similarly a D2Q quality schema instance is agraph with quality objects nodes and leavescorresponding to quality properties valuesLeaves are labelled by a couple quality type
property name quality type property valueSEdges are not labelled at all
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
[APTA]
Code[AssociazionePromozione hellip]
Owner1
Owner2
EntNameFiscalCode
Name
[MCLMSM73H17H501F] [Via dei Gracchi 71
00192 Roma Italy]
[Massimo Mecella]
Address
FiscalCode
[SCNMNC74S60G230T]
[S
Enterprise1
Code_Qua
EntN
FiscalC
Enterp
Code_Accuracy[high] EntNa
[high
quality
Fig 7 D2Q schema instance
Finally a D2Q schema instance comprises thetwo previous ones plus edges actually connectingdata objects and quality object corresponding toquality links labelled with a quality label
Fig 7 shows the graph representation of a D2Q
schema instance of the running example TheEnterprise1 object instance of the Enterprise
class has two owners ie Owner1 and Owner2objects of the Owner class the associatedquality values are also shown (only quality linksrelated to data objects are shown for the sake ofsimplicity)
35 Implementation of the D2Q model
In this section we describe how D2Q schemasare translated into XML Schemas [1920] thatcorrespond to the local schemas to be queried inorder to retrieve data with the associated quality inthe CIS
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
Name
[Via Mazzini 27 84016 Pagani (SA) Italy]Monica
cannapieco]
Address
lity
ame_Quality
ode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
rise_Quality
me_Accuracy]
Owner1_Quality
Address_Accuracy[high]
Address_Currency[medium]
Name_Accuracy[high]
Owner2_Quality
FiscalCode_QualityName
FiscalCode_Accuracy[high]
Address_Quality
Address_Accuracy[medium]
Name_Accuracy[medium]
quality
quality
of the running example
ARTICLE IN PRESS
Fig 8 An example of definition of a quality type as an XML
Schema type
M Scannapieco et al Information Systems 29 (2004) 551ndash582562
The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q
schema instances as XML documents is alsodescribed in the following of this section
351 From a D2Q schema to an XML schema
D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]
Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them
Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate
The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be
identified by explicitly introducing Object Identi-fiers (OIDs)
Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects
The principal choices for the representation ofD2Q schemas as XML Schemas were
representing data and quality classes and alsotheir properties as XML elements
limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association
definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model
definition of a root element for each D2Q dataschema
definition of a root element for each D2Q
quality schema
An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types
The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion
352 From a D2Q schema instance to an XML file
The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one
ARTICLE IN PRESS
Fig 9 XML Schema of the D2Q data schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 563
is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules
a root element is defined to contain all dataobjects
a root element is defined to contain all qualityobjects
an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality
a property of a data class is mapped into anXML element
similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it
a property of a quality class is also mapped intoan XML element
We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9
4 The Data Quality Broker
The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation
41 Overview
The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q
global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints
ARTICLE IN PRESS
Fig 10 XML Schema of the D2Q quality schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582564
Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway
in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried
A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater
ARTICLE IN PRESS
Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 565
mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)
The Data Quality Broker performs two specifictasks
query processing and quality improvement
The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43
42 Query processing
The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q
model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved
Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582566
and we select the copy to return as a result on thebasis of such values
Specifically the detailed steps we follow toanswer a query with quality constraints are
(1)
Table
Map
Glob
Enter
Enter
Enter
Hold
Hold
y
y
Let Q be a query posed on the global schemaG
(2)
The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources
(3)
-Name-ID-LegalAddress
Enterprise
The execution of the queries Q1yQn
returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]
(4)
-FiscalCode-Address-Name
Holder
1
Fig 12 Graphical representation of the global schema
The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values
1
ping between global and local schemas
al schema INPS schema
prise Enterprise
priseID EnterpriseCode
priseLegalAddress X
er Owner
erName OwnerName
y
y
421 Example of query processing
In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1
Let us consider the following XQuery query tobe posed on the global schema
for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)
Enterprise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-
INAIL schema CoC schema
Business Enterprise
BusinessCode EnterpriseCode
X EnterpriseLegalAddress
Owner Holder
OwnerName HolderName
y y
y y
ARTICLE IN PRESS
Accuracy
item
2
Acc
Name=
medium
Acc
Address
frac14medium
Acc
Namefrac14
high
Italyrsquorsquo
Acc
Address
frac14high
Acc
Namefrac14
low
Acc
Address
frac14low
M Scannapieco et al Information Systems 29 (2004) 551ndash582 567
sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL
iecorsquorsquo
84016Paganirsquorsquo
piecorsquorsquo
84016Pagani(SA)
Italyrsquorsquo
for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-
prise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
Table
2
Queryresultsin
asample
scenario
Organization
Holderitem
1Accuracy
item
1Holderitem
2
INAIL
lsquolsquoMaim
oMecellarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanap
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
INPS
lsquolsquoMMecallarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanna
lsquolsquoVia
dei
Gracchi71RomaItalyrsquorsquo
Acc
Address
frac14medium
lsquolsquoVia
Mazzini27
CoC
lsquolsquoMassim
oMecellarsquorsquo
Acc
Namefrac14
high
lsquolsquoMScannarsquorsquo
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high
Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data
Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
Fig 8 An example of definition of a quality type as an XML
Schema type
M Scannapieco et al Information Systems 29 (2004) 551ndash582562
The results of queries posed over local schemasare D2Q schema instances that are exchanged asXML documents the implementation of D2Q
schema instances as XML documents is alsodescribed in the following of this section
351 From a D2Q schema to an XML schema
D2Q schemas represent in a formal and well-defined way (i) the schema of data exported byeach organization in the CIS (ii) the schema ofquality data and (iii) the associations relating datato their quality From an implementation point ofview we have chosen to export D2Q schemas asXML Schemas [1920]
Generally speaking the XML Schema grammarallows one to define data types and to declareelements that use these data types we exploit thisproperty in order to represent a D2Q schema as anXML Schema We chose to have two distinctdocuments representing respectively a D2Q dataschema and the associated D2Q quality schemaThis is motivated by making as flexible as possiblethe choice of exporting only data as local schemasor both data and quality Indeed for the end userit may be easier to use predefined functions thatallow to access quality data instead of directlyquerying them
Starting from a D2Q schema there are severalways to translate it into an XML Schema Wehave chosen a translation that maps the graphstructure of the D2Q schema into the tree structureof the data model underlying XQuery which is thequery language we chose to adopt We could havechosen to directly represent a D2Q schema with agraph structure indeed XML allows to have agraph data model by using mechanisms such asXLink [21] or XPointer [22] Our idea was insteadto have a query mechanisms as direct as possibleand in this sense the tree structure of XQuery wasthe best candidate
The translation of the graph structure into atree structure is implemented by replicating objectswhereas necessary Indeed in the D2Q datamodel a data class can be a property of differentdata classes ie it can have different parentsIn these cases the data class is replicatedand at object level the same instances can be
identified by explicitly introducing Object Identi-fiers (OIDs)
Quality links are implemented by Quality ObjectIdentifiers (QOIDrsquos) both on data and qualityobjects
The principal choices for the representation ofD2Q schemas as XML Schemas were
representing data and quality classes and alsotheir properties as XML elements
limiting the use of XML attributes only torepresent control information ie object iden-tifier and quality object identifier together withinformation to represent quality association
definition of XML Schema types that corre-spond to basic and quality types defined in theD2Q model
definition of a root element for each D2Q dataschema
definition of a root element for each D2Q
quality schema
An example of definition of a quality type of theD2Q model as an XML Schema type is shown inFig 8 where the quality type accuracy Type isdefined Notice that quality types are defined assimple types that are a restriction of XML Schemasimple types
The two XML schemas corresponding to theD2Q schema shown in Fig 6 are shown in Fig 9for the data portion and in Fig 10 for the qualityportion
352 From a D2Q schema instance to an XML file
The translation rules from a D2Q schemainstance to an XML file reflects the rules givenfor the D2Q schema translation described in theprevious section In synthesis two XML docu-ments correspond to a D2Q schema instance one
ARTICLE IN PRESS
Fig 9 XML Schema of the D2Q data schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 563
is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules
a root element is defined to contain all dataobjects
a root element is defined to contain all qualityobjects
an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality
a property of a data class is mapped into anXML element
similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it
a property of a quality class is also mapped intoan XML element
We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9
4 The Data Quality Broker
The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation
41 Overview
The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q
global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints
ARTICLE IN PRESS
Fig 10 XML Schema of the D2Q quality schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582564
Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway
in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried
A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater
ARTICLE IN PRESS
Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 565
mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)
The Data Quality Broker performs two specifictasks
query processing and quality improvement
The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43
42 Query processing
The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q
model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved
Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582566
and we select the copy to return as a result on thebasis of such values
Specifically the detailed steps we follow toanswer a query with quality constraints are
(1)
Table
Map
Glob
Enter
Enter
Enter
Hold
Hold
y
y
Let Q be a query posed on the global schemaG
(2)
The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources
(3)
-Name-ID-LegalAddress
Enterprise
The execution of the queries Q1yQn
returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]
(4)
-FiscalCode-Address-Name
Holder
1
Fig 12 Graphical representation of the global schema
The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values
1
ping between global and local schemas
al schema INPS schema
prise Enterprise
priseID EnterpriseCode
priseLegalAddress X
er Owner
erName OwnerName
y
y
421 Example of query processing
In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1
Let us consider the following XQuery query tobe posed on the global schema
for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)
Enterprise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-
INAIL schema CoC schema
Business Enterprise
BusinessCode EnterpriseCode
X EnterpriseLegalAddress
Owner Holder
OwnerName HolderName
y y
y y
ARTICLE IN PRESS
Accuracy
item
2
Acc
Name=
medium
Acc
Address
frac14medium
Acc
Namefrac14
high
Italyrsquorsquo
Acc
Address
frac14high
Acc
Namefrac14
low
Acc
Address
frac14low
M Scannapieco et al Information Systems 29 (2004) 551ndash582 567
sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL
iecorsquorsquo
84016Paganirsquorsquo
piecorsquorsquo
84016Pagani(SA)
Italyrsquorsquo
for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-
prise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
Table
2
Queryresultsin
asample
scenario
Organization
Holderitem
1Accuracy
item
1Holderitem
2
INAIL
lsquolsquoMaim
oMecellarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanap
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
INPS
lsquolsquoMMecallarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanna
lsquolsquoVia
dei
Gracchi71RomaItalyrsquorsquo
Acc
Address
frac14medium
lsquolsquoVia
Mazzini27
CoC
lsquolsquoMassim
oMecellarsquorsquo
Acc
Namefrac14
high
lsquolsquoMScannarsquorsquo
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high
Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data
Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
Fig 9 XML Schema of the D2Q data schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 563
is an instance of the D2Q data schema and theother one is an instance of the D2Q qualityschema They are obtained from the respectiveD2Q schemas according to the following rules
a root element is defined to contain all dataobjects
a root element is defined to contain all qualityobjects
an object d instance of a data class maps into anXML element with an OID attribute identify-ing the object and a qOID attribute pointing tothe quality
a property of a data class is mapped into anXML element
similarly an object dq instance of a qualityclass is mapped into an XML element with aqOID identifying it
a property of a quality class is also mapped intoan XML element
We show in Fig 11 a portion of the XML filecorresponding to the D2Q data schema instanceshown in Fig 9
4 The Data Quality Broker
The Data Quality Broker ethDQBTHORN is the corecomponent of the architecture It is implementedas a peer-to-peer distributed service each organi-zation hosts a copy of the Data Quality Brokerthat interacts with other copies We first providean overview of the high-level behavior of the DataQuality Broker in Sections 41 and 42 then inSections 43 and 44 we explain the details of itsdesign and implementation
41 Overview
The general task of the Data Quality Broker isto allow users to select data in the CIS accordingto their quality Users can issue queries on a D2Q
global schema specifying constraints on qualityvalues Then the Data Quality Broker selectsamong all the copies of the requested data that arein the CIS only those satisfying all specifiedconstraints
ARTICLE IN PRESS
Fig 10 XML Schema of the D2Q quality schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582564
Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway
in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried
A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater
ARTICLE IN PRESS
Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 565
mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)
The Data Quality Broker performs two specifictasks
query processing and quality improvement
The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43
42 Query processing
The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q
model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved
Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582566
and we select the copy to return as a result on thebasis of such values
Specifically the detailed steps we follow toanswer a query with quality constraints are
(1)
Table
Map
Glob
Enter
Enter
Enter
Hold
Hold
y
y
Let Q be a query posed on the global schemaG
(2)
The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources
(3)
-Name-ID-LegalAddress
Enterprise
The execution of the queries Q1yQn
returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]
(4)
-FiscalCode-Address-Name
Holder
1
Fig 12 Graphical representation of the global schema
The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values
1
ping between global and local schemas
al schema INPS schema
prise Enterprise
priseID EnterpriseCode
priseLegalAddress X
er Owner
erName OwnerName
y
y
421 Example of query processing
In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1
Let us consider the following XQuery query tobe posed on the global schema
for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)
Enterprise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-
INAIL schema CoC schema
Business Enterprise
BusinessCode EnterpriseCode
X EnterpriseLegalAddress
Owner Holder
OwnerName HolderName
y y
y y
ARTICLE IN PRESS
Accuracy
item
2
Acc
Name=
medium
Acc
Address
frac14medium
Acc
Namefrac14
high
Italyrsquorsquo
Acc
Address
frac14high
Acc
Namefrac14
low
Acc
Address
frac14low
M Scannapieco et al Information Systems 29 (2004) 551ndash582 567
sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL
iecorsquorsquo
84016Paganirsquorsquo
piecorsquorsquo
84016Pagani(SA)
Italyrsquorsquo
for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-
prise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
Table
2
Queryresultsin
asample
scenario
Organization
Holderitem
1Accuracy
item
1Holderitem
2
INAIL
lsquolsquoMaim
oMecellarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanap
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
INPS
lsquolsquoMMecallarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanna
lsquolsquoVia
dei
Gracchi71RomaItalyrsquorsquo
Acc
Address
frac14medium
lsquolsquoVia
Mazzini27
CoC
lsquolsquoMassim
oMecellarsquorsquo
Acc
Namefrac14
high
lsquolsquoMScannarsquorsquo
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high
Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data
Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
Fig 10 XML Schema of the D2Q quality schema of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582564
Quality constraints can be related to quality ofdata andor quality of service (QoS) Specificallybesides the four data quality dimensions pre-viously introduced two QoS dimensions aredefined namely (i) a responsiveness defined bythe average round-trip delay evaluated byeach organization periodically pinging gateway
in the CIS and (ii) availability of a sourceevaluated by pinging a gateway and consideringif a timeout expires in such a case the source isconsidered not available for being currentlyqueried
A user can request data with a specified quality(eg lsquolsquoreturn enterprises with accuracy greater
ARTICLE IN PRESS
Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 565
mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)
The Data Quality Broker performs two specifictasks
query processing and quality improvement
The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43
42 Query processing
The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q
model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved
Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582566
and we select the copy to return as a result on thebasis of such values
Specifically the detailed steps we follow toanswer a query with quality constraints are
(1)
Table
Map
Glob
Enter
Enter
Enter
Hold
Hold
y
y
Let Q be a query posed on the global schemaG
(2)
The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources
(3)
-Name-ID-LegalAddress
Enterprise
The execution of the queries Q1yQn
returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]
(4)
-FiscalCode-Address-Name
Holder
1
Fig 12 Graphical representation of the global schema
The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values
1
ping between global and local schemas
al schema INPS schema
prise Enterprise
priseID EnterpriseCode
priseLegalAddress X
er Owner
erName OwnerName
y
y
421 Example of query processing
In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1
Let us consider the following XQuery query tobe posed on the global schema
for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)
Enterprise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-
INAIL schema CoC schema
Business Enterprise
BusinessCode EnterpriseCode
X EnterpriseLegalAddress
Owner Holder
OwnerName HolderName
y y
y y
ARTICLE IN PRESS
Accuracy
item
2
Acc
Name=
medium
Acc
Address
frac14medium
Acc
Namefrac14
high
Italyrsquorsquo
Acc
Address
frac14high
Acc
Namefrac14
low
Acc
Address
frac14low
M Scannapieco et al Information Systems 29 (2004) 551ndash582 567
sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL
iecorsquorsquo
84016Paganirsquorsquo
piecorsquorsquo
84016Pagani(SA)
Italyrsquorsquo
for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-
prise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
Table
2
Queryresultsin
asample
scenario
Organization
Holderitem
1Accuracy
item
1Holderitem
2
INAIL
lsquolsquoMaim
oMecellarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanap
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
INPS
lsquolsquoMMecallarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanna
lsquolsquoVia
dei
Gracchi71RomaItalyrsquorsquo
Acc
Address
frac14medium
lsquolsquoVia
Mazzini27
CoC
lsquolsquoMassim
oMecellarsquorsquo
Acc
Namefrac14
high
lsquolsquoMScannarsquorsquo
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high
Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data
Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
Fig 11 A portion of the XML file corresponding to data shown in the D2Q schema instance of the running example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 565
mediumrsquorsquo) or declare a limit on the query proces-sing time restricting the query only to availablesources (eg lsquolsquoreturn only enterprises from the firstresponding sourcersquorsquo) or include both kinds ofconstraints (eg lsquolsquoreturn enterprises with accuracygreater than medium provided by the firstresponding sourcersquorsquo)
The Data Quality Broker performs two specifictasks
query processing and quality improvement
The semantics of query processing is described inSection 42 while how query processing is performedis described in Section 432 The quality improve-ment feature consists of notifying organizations withlow quality data about higher quality data that areavailable through the CIS This step is enacted eachtime copies of the same data with different qualityare collected during the query processing activityThe best quality copy is sent to all organizationshaving lower quality copies Organizations involvedin the query have the choice of updating or not theirdata This gives a non-invasive feedback that allowsto enhance the overall quality of data in the CISwhile preserving organizationsrsquo autonomy The de-tails of how the improvement feature is realized areprovided in Section 43
42 Query processing
The Data Quality Broker performs queryprocessing according to a global-as-view (GAV)approach by unfolding queries posed over aglobal schema ie replacing each atom of theoriginal query with the corresponding view onlocal data sources [238] Both the global schemaand local schemas exported by cooperating orga-nizations are expressed according to the D2Q
model The adopted query language is XQuery[24] The unfolding of an XQuery query issued onthe global schema can be performed on the basis ofwell-defined mappings with local sources Detailsconcerning such a mapping are beyond the scopeof this paper (details can be found in [25]) In thefollowing of this paper we assume that global andlocal schemas with their mappings exist Specifi-cally we focus on instance level problems assum-ing that all issues concerning semanticheterogeneities have been solved
Under our hypothesis data sources have dis-tinct copies of the same data having differentquality levels ie there are instance-level conflictsWe resolve these conflicts at query execution timeby relying on quality values associated to datawhen a set of different copies of the same data arereturned we look at the associated quality values
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582566
and we select the copy to return as a result on thebasis of such values
Specifically the detailed steps we follow toanswer a query with quality constraints are
(1)
Table
Map
Glob
Enter
Enter
Enter
Hold
Hold
y
y
Let Q be a query posed on the global schemaG
(2)
The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources
(3)
-Name-ID-LegalAddress
Enterprise
The execution of the queries Q1yQn
returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]
(4)
-FiscalCode-Address-Name
Holder
1
Fig 12 Graphical representation of the global schema
The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values
1
ping between global and local schemas
al schema INPS schema
prise Enterprise
priseID EnterpriseCode
priseLegalAddress X
er Owner
erName OwnerName
y
y
421 Example of query processing
In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1
Let us consider the following XQuery query tobe posed on the global schema
for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)
Enterprise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-
INAIL schema CoC schema
Business Enterprise
BusinessCode EnterpriseCode
X EnterpriseLegalAddress
Owner Holder
OwnerName HolderName
y y
y y
ARTICLE IN PRESS
Accuracy
item
2
Acc
Name=
medium
Acc
Address
frac14medium
Acc
Namefrac14
high
Italyrsquorsquo
Acc
Address
frac14high
Acc
Namefrac14
low
Acc
Address
frac14low
M Scannapieco et al Information Systems 29 (2004) 551ndash582 567
sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL
iecorsquorsquo
84016Paganirsquorsquo
piecorsquorsquo
84016Pagani(SA)
Italyrsquorsquo
for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-
prise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
Table
2
Queryresultsin
asample
scenario
Organization
Holderitem
1Accuracy
item
1Holderitem
2
INAIL
lsquolsquoMaim
oMecellarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanap
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
INPS
lsquolsquoMMecallarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanna
lsquolsquoVia
dei
Gracchi71RomaItalyrsquorsquo
Acc
Address
frac14medium
lsquolsquoVia
Mazzini27
CoC
lsquolsquoMassim
oMecellarsquorsquo
Acc
Namefrac14
high
lsquolsquoMScannarsquorsquo
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high
Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data
Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582566
and we select the copy to return as a result on thebasis of such values
Specifically the detailed steps we follow toanswer a query with quality constraints are
(1)
Table
Map
Glob
Enter
Enter
Enter
Hold
Hold
y
y
Let Q be a query posed on the global schemaG
(2)
The query Q is unfolded according to thestatic mapping that defines each concept ofthe global schema in terms of the localsources such a mapping is defined in orderto retrieve all copies of same data that areavailable in the CIS Therefore the query Q isdecomposed in Q1yQn queries to be posedover local sources
(3)
-Name-ID-LegalAddress
Enterprise
The execution of the queries Q1yQn
returns a set of results R1yRn On such aset an extensional correspondence property ischecked namely given Ri and Rj the itemscommon to both results are identified byconsidering the items referring to the sameobjects This step is performed by using arecord matching algorithm based on qualityvalues exported by organizations [9]
(4)
-FiscalCode-Address-Name
Holder
1
Fig 12 Graphical representation of the global schema
The result to be returned is built as follows (i)if no quality constraint is specified a bestquality default semantics is adopted Thismeans that the result is constructed byselecting the best quality values (this step isdetailed in 432) (ii) if quality constraintsare specified the result is constructedby checking the satisfiability of theconstraints on the whole result Notice thatthe quality selection is performed at aproperty level on the basis of best (satisfying)quality values
1
ping between global and local schemas
al schema INPS schema
prise Enterprise
priseID EnterpriseCode
priseLegalAddress X
er Owner
erName OwnerName
y
y
421 Example of query processing
In the scenario proposed in Section 21 let usassume that the global schema derived from theINPS INAIL and CoC local schemas is the oneproposed in Fig 12 Some examples of mappinginfo between global and local schemas are shownin Tables 1
Let us consider the following XQuery query tobe posed on the global schema
for $b in document(lsquolsquoGlobalSchemaxmlrsquorsquo)
Enterprise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
On the basis of mapping information theunfolding of concepts in the global schema isperformed by substituting the semantically corre-
INAIL schema CoC schema
Business Enterprise
BusinessCode EnterpriseCode
X EnterpriseLegalAddress
Owner Holder
OwnerName HolderName
y y
y y
ARTICLE IN PRESS
Accuracy
item
2
Acc
Name=
medium
Acc
Address
frac14medium
Acc
Namefrac14
high
Italyrsquorsquo
Acc
Address
frac14high
Acc
Namefrac14
low
Acc
Address
frac14low
M Scannapieco et al Information Systems 29 (2004) 551ndash582 567
sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL
iecorsquorsquo
84016Paganirsquorsquo
piecorsquorsquo
84016Pagani(SA)
Italyrsquorsquo
for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-
prise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
Table
2
Queryresultsin
asample
scenario
Organization
Holderitem
1Accuracy
item
1Holderitem
2
INAIL
lsquolsquoMaim
oMecellarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanap
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
INPS
lsquolsquoMMecallarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanna
lsquolsquoVia
dei
Gracchi71RomaItalyrsquorsquo
Acc
Address
frac14medium
lsquolsquoVia
Mazzini27
CoC
lsquolsquoMassim
oMecellarsquorsquo
Acc
Namefrac14
high
lsquolsquoMScannarsquorsquo
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high
Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data
Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
Accuracy
item
2
Acc
Name=
medium
Acc
Address
frac14medium
Acc
Namefrac14
high
Italyrsquorsquo
Acc
Address
frac14high
Acc
Namefrac14
low
Acc
Address
frac14low
M Scannapieco et al Information Systems 29 (2004) 551ndash582 567
sponding concept in the local schema The schemamapping information shown in 1 is a simplifiedversion of the actual mapping information thatincludes the specification of path expressionsexactly locating concepts in the local schemasTherefore the unfolding process requires somespecific technicalities we have implemented thecomplete unfolding and refolding process detailsof which can be found in [25] As an example thefollowing local query is posed on INAIL
iecorsquorsquo
84016Paganirsquorsquo
piecorsquorsquo
84016Pagani(SA)
Italyrsquorsquo
for $b in document(lsquolsquoInailxmlrsquorsquo)Enter-
prise
where $bCode =lsquolsquoAPTArsquorsquo
returnHolderSf$b=Namegf$b=Addressg=HolderS
Table
2
Queryresultsin
asample
scenario
Organization
Holderitem
1Accuracy
item
1Holderitem
2
INAIL
lsquolsquoMaim
oMecellarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanap
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
INPS
lsquolsquoMMecallarsquorsquo
Acc
Namefrac14
low
lsquolsquoMonicaScanna
lsquolsquoVia
dei
Gracchi71RomaItalyrsquorsquo
Acc
Address
frac14medium
lsquolsquoVia
Mazzini27
CoC
lsquolsquoMassim
oMecellarsquorsquo
Acc
Namefrac14
high
lsquolsquoMScannarsquorsquo
lsquolsquoVia
dei
Gracchi7100192RomaItalyrsquorsquo
Acc
Address
frac14high
lsquolsquoVia
Mazzini27
Results produced by executing XQuery querieslocally to INPS INAIL and CoC include bothdata and associated quality Table 2 illustratessuch results Each organization returns twoholders for the specified business with theassociated accuracy values As an example accu-racy of properties Name and Address of the firstHolder item provided by INAIL are respectivelylow and high
Once results are gathered from INPS INAILand CoC a record matching process is performedIn [9] we proposed an algorithm for recordmatching based on the Sorted NeighborhoodMethod (SNM) The SNM consists of threedistinct steps (i) choice of the matching key (ii)sorting of records according to the chosen keythus originating clusters of records that have thesame key value as aggregating element (iii)moving of a fixed size window through the list ofrecords in each cluster and comparisons only ofthe records included in the window We adopt thisalgorithm but we make automatic the choice ofthe matching key by relying onto quality valuesassociated to data
Specifically data objects returned by an organi-zation are tentatively mapped with data objectsreturned by the other ones In the example byusing Name as a key for matching it is easy to findtwo clusters namely the Holder item 1 cluster andthe Holder item 2 cluster Inside each cluster an
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582568
instance level reconciliation is performed on thebasis of quality values
According to the shown accuracy values theresult is composed as follows
HolderName frac14 lsquolsquoMassimo Mecellarsquorsquo
is selected from CoC being its accuracy frac14max(Acc Name(INAIL) Acc Name(INPS) Ac-
c Name(CoC)) frac14 Acc Name(CoC) frac14 highand
HolderAddress frac14 lsquolsquoVia dei Gracchi 71 00192 Roma Italyrsquorsquo
is selected from INAIL being its accuracy frac14max(Acc Address(INAIL) Acc Address(INPS)Acc Address(CoC)) frac14 Acc Address(INPS) frac14high
Similar considerations lead to the selection ofthe Holder item 2 provided by INPS
43 Internal architecture and deployment
In this section we first detail the design of theData Quality Broker by considering its interac-tion modes and its constituting modules then anexample of query processing execution is shown
431 Interaction modes and application interface
The Data Quality Broker exports the followinginterface to its clients
invoke(query mode) for submitting a queryQ over the global schema including quality
Broker Org1 Orgm
Invoke Q
Invoke Q
D1
Dm
Db=Compare D1hellipDm
Db
Invoke Q hellip
hellip
hellip
(a) (b
Fig 13 Broker invocation mode
constraints The behavior of the broker can beset with the mode parameter such a parametercan assume either the BASIC or OPTIMIZED
valueIn the basic mode the broker accesses all
sources that provide data satisfying the queryand builds the result whereas in the optimized
mode the broker accesses only those organiza-tions that probably will provide good qualitydata Such information can be derived on thebasis of statistics on quality values Note that inthe optimized mode the broker does notcontact all sources and does not have to waitfor all responses however this optimizationcan lead to non-accurate results Conversely inthe basic mode all sources are always con-tacted it takes more time to retrieve the finalresult but it is always the most accurate one
Fig 13 shows the sequence of operationsperformed in each of the two modes in basicmode (Fig 13(a)) all organizations storing datasatisfying a query are queried and all resultshave to be waited for before proceeding Afterresults are retrieved a comparison phase isrequired to choose the best data In optimizedmode (Fig 13(b)) only the organization Orgj
that has the best overall statistical qualityevaluation is involved in the query and nocomparison is required
propose(data) for notifying all sources thathave previously sent data with lower quality
Broker Orgj
Invoke Q
Dj
Dj
Invoke Q
)
s (a) Basic (b) optimised
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 569
than the selected one with higher qualityresults This function can be only invoked inthe basic mode Each time a query is poseddifferent copies of same data are received asanswers the broker submits to organizationsthe best quality copy R selected among thereceived ones The propose() operation isinvoked by a peer instance of the broker inorder to provide the quality feedback
432 Modules of the Data Quality Broker
The Data Quality Broker is internally composedof five interacting modules (Fig 14) The DataQuality Broker is deployed as a peer-to-peerservice each organization hosts an independentcopy of the broker and the overall service isrealized through an interaction among the variouscopies performed via the Transport Enginemodule The modules Query Engine TransportEngine and Schema Mapper are general and canbe installed without modifications in each organi-zation The module Wrapper has to be customizedfor the specific data storage system The moduleComparator can also be customized to implementdifferent criteria for quality comparison
Query Engine ethQETHORN receives a query from theclient and processes it The query is executedissuing sub-queries to local data sources The QE
can also perform some query optimizations duringthe processing
Wrapper ethWrTHORN translates the query from thelanguage used by the broker to the one of thespecific data source In this work the wrapper is a
DBQ
invoke() propose()
QE
TE
Cmp
SM
WrDB
Fig 14 Broker modules
read-only access module that is it returns datastored inside organizations without modifyingthem
Transport Engine ethTETHORN communication facilitythat transfers queries and their results between theQE and data source wrappers It also evaluates theQoS parameters and provides them to the QE forbeing used in query optimization
Schema Mapper ethSMTHORN stores the static map-pings between the global schema and the localschemas of the various data sources
Comparator ethCMPTHORN compares the quality va-lues returned by the different sources to choose thebest one
In the following we detail the behaviour of QE
and TE These are the modules that required mostdesign effort due to their higher complexity TheSM stores mappings between the global andlocal schemas the Wr is a classical databasewrapper and the CMP compares quality valuevectors according to ranking methods known inliterature such as simple additive weighting(SAW) [26] or analytical hierarchy process(AHP) [27]
The Query Engine is responsible for queryexecution and optimization The copy of theQuery Engine local with the user receives thequery and splits the query in sub-queries (local tothe sources) by interacting with the SchemaMapper (which manages the mappingbetween the global schema and the local schemas)Then the local QE also interacts with the TE inorder to (i) send local queries to other copies ofthe QE and receive the answers (ii) benotified about QoS parameters to be used foroptimization
The Transport Engine provides general connec-tivity among all Data Quality Broker instances inthe CIS Copies of the TE interact with each otherin two different scenarios corresponding to twodifferent interaction types
Query execution (requestreply interaction) therequesting TE sends a query to the local TE ofthe target data source and remains blockedwaiting for the result A request is alwayssynchronous but it can be always stopped fromthe outside for example when the QE decides
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582570
to block the execution of a query to enforceQoS constraints In this case the results arediscarded
Quality feedback (asynchronous interaction)when a requesting QE has selected the bestquality result of a query it contacts the localTErsquos to enact quality feedback propagationThe propose() operation is executed as acallback on each organization with the bestquality selected data as a parameter Thepropose() can be differently implemented byeach organization a remote TE simply invokesthis operation whose behaviour depends fromthe strategy that an organization chooses forthe improvement of its own data on the basis ofexternal notifications
The other function performed by the TE is theevaluation of the QoS parameters This feature isencapsulated into the TE as it can be easilyimplemented exploiting TErsquos communication cap-abilities Specifically the TE periodically pings datasources in order to calculate responsiveness More-over the timeout value necessary to calculate theavailability QoS parameter is configured in the TE
433 Query processing in the basic mode
In this section we give details of basic modequery invocations in order to clarify the behaviorof the Data Quality Broker during a querySpecifically we describe the interaction amongpeer Data Quality Brokers and among modulesinside each Data Quality Broker when a querywith quality requirements is submitted by a user
After receiving the query the Query Engine(local to the client) interacts with the SchemaMapper and generates a set of sub-querieswhich in turn are sent through the local TransportEngine to all Transport Engines of the involvedorganizations All Transport Engines transfer thequery to local Query Engines that execute localqueries by interacting with local Wrappers Queryresults are passed back to the local TransportEngine that passes all results to the local QueryEngine This module selects the best quality resultthrough the Comparator and passes it to the clientOnce the best quality data is selected the feedbackprocess starts
A second phase involves all sources thathave previously sent data with lower quality thanthe selected one each source may adoptthe proposed data (updating its local datastore) or not Indeed criteria to decide whetherupdating the local data store with the feedbackdata may vary from one organization to anothersuch criteria can be set by providing properimplementations of the propose() operation
Fig 15 depicts an invocation by a client ofthe organization Org1 specifically the invoke
ethlsquolsquoSelect APTA from Enterprises with accur-acyXmediumrsquorsquo BASIC) For the sake of clarityonly involved modules are depicted The sequ-ence of actions performed during the queryinvocation is described in the following Weindicate with a subscript the specific instance of amodule belonging to a particular organization forexample QEi indicates the Query Engine oforganization Orgi
(1)
QE1 receives the query Q posed on the globalschema as a parameter of the invoke()
function
(2)
QE1 retrieves the mapping with local sources
from SM1
(3)
QE1 performs the unfolding that returns
queries for Org2 and Org3 that are passedto TE1
(4)
TE1 sends the request to Org2 and Org3
(5)
The recipient TErsquos pass the request to the
Wrrsquos modules Each Wr accesses the localdata store to retrieve data
(6)
Wr3 retrieves a query result eg Ra with
accuracy 07 and Wr2 retrieves another queryresult Rb with accuracy 09
(7)
Each TE sends the result to TE1
(8)
TE1 sends the collected results (Ra Rb) to
QE1 QE1 selects through CMP1 the resultwith the greatest accuracy ie Org2rsquos result(Rb with accuracy 09)
(9)
QE1 sends the selected result to the client
(10)
QE1 starts the feedback process sending
the selected result (Rb with accuracy 09) toTE1
(11)
TE1 sends it to Org3
(12)
TE3 receives the feedback data and makes a
call propose(Rb with accuracy 09)
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5DQB
QE
TE
SM DB
DQB
TE WR
DB
DQB
TE WR
DB
invoke()1
2
3
4
4
5
5
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
DQB
TE
Cmp DBTE WR
DB
DQB
TE WR
DB
8
7
7
6
6
10
9
11 12
propose()
DQB
QE8
(a)
(b)
Fig 15 Details of query invocation in the basic mode (a) Step 1ndash5 (b) Step 6ndash12
5httpjavasuncomxmljaxrpc6httpmsdnmicrosoftcomnetframework
M Scannapieco et al Information Systems 29 (2004) 551ndash582 571
44 Implementation
We have implemented a prototype of thebroker working in the basic query invocationmode In our prototype the Data Quality Brokeris realized as a Web Service [28] deployed on allthe organizations A Web Service-based solution issuitable for integrating heterogeneous organiza-tions as it allows to deploy the Data QualityBroker on the Internet regardless from the accesslimitations imposed by firewalls interoperabilityissues etc Moreover by exploiting standard Web
Service technologies the technological integrationamong different organizations is straightforwardspecifically we tested interoperability amongdifferent versions of the Data Quality Brokerrealized using respectively the JAX-RPC5 frame-work and the NET one6
As we discussed above the component respon-sible for the communication between the DataQuality Broker instances is the Transport Engine
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582572
Besides the operations previously described (iepropose() and invoke()) it exposes also theping() operation which is used to estimate thequery responsiveness Such operations are invokedthrough standard SOAP messages sent overHTTP specifically invocations to multipleTransport Engines are performed in parallelRemote invocations to the invoke() operationrequire the client Transport Engine to wait for allresults
If we assume that Data Quality Broker instancesare subject to faults this can cause the client to waitindefinitely for responses from faulty instancesFrom the distributed system theory stems that incommunication systems such as the Internetwhere delays on message transfer cannot bepredicted it is impossible to distinguish a faultysoftware component from a very slow one [2930]Then the client Transport Engine sets a timeoutfor each invoked source The timeout value isestimated for each single source on the basis of itsresponsiveness After the timeout expires no resultis returned for that query This is a best-effortsemantics that does not guarantee to always returnthe best quality value but rather enforces thecorrect termination of each query invocationReplication techniques can be exploited in orderto enhance the availability of each web serviceensuring more accurate results even in presence offaults [29]
7Let us remark that each attribute has an associated set of
quality dimensions
5 The quality notification service
The quality notification service (QNS) is used toinform interested users when changes in qualityvalues occur within the CIS
The interaction between the Quality Notifica-tion Service and a user follows the publishsubscribe paradigm a user willing to be notifiedfor quality changes subscribes to the QualityNotification Service by submitting the features ofthe events to be notified for through a specificsubscription language When a change in qualityhappens an event is published by the QualityNotification Service ie all the users which have aconsistent subscription receive a notification
The Quality Notification Service can be in theCIS to control the quality of critical data eg inorder to keep track of its quality changes and bealways aware when quality degrades under acertain threshold which makes it no longer suitedfor the use it is devoted to The Quality Notifica-tion Service can also exploited by other architec-tural components to maintain updated theinformation they use to perform their services
In this section we first introduce the QNSspecification ie (i) the language accepted by theQuality Notification Service to let user specify theset of notifications they are interested in and (ii)the form of quality change notifications receivedby subscribed users then we motivate and detailthe internal architecture of the service Theexplanation makes use of the running exampleintroduced in Section 21
51 Specification
Fig 16 illustrates a high-level depiction of theQuality Notification Service and its relationshipswith users and with the Quality Factory A quality
change occurs within an organization each time thevalue of a given quality dimension of a propertyvaries7 Quality changes are managed by theQuality Factory of each specific organization in away that depends on the internal policies andinfrastructures of the organization Upon eachquality change the Quality Factory reports to theQuality Notification Service the data objectinvolved in the change (according to the datarepresentation of the global schema) along withthe associated quality data As aforementionedthe Quality Notification Service is thus in chargeof notifying the new data to all interested usersaccording to the values of the quality data
Subscription language The Quality NotificationService subscription language has to allow quality-based subscriptions where quality dimensions areconstrained by threshold values In order toprovide users with a flexible subscription languagethe subscription granularity should allow toexpress interest in quality changes of sets of data
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
subscriber
QNS
QualityFactory
∆Qev
publisher
sub∆Q
∆Qev
Fig 16 Quality Notification Service
8For the sake of simplicity we omit to present aspects related
to user unsubscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582 573
objects As examples an user could be interested inmonitoring accuracy changes of enterprises lo-cated in Rome whose records are stored through-out the whole CIS whether another user could willto subscribe for currency changes of a specificenterprise whose records are stored in a specificorganization
Therefore a generic Quality Notification Servicesubscription is a triple in the form
subDQ frac14 d frac12 S exprD exprQS
where
d is a data class expressed on the global schema S is an optional set of sources for the data class
d ie organizations containing data objectsbelonging to d according to the mapping rules
exprD is an expression used to specify the set ofdata objects of class d whose quality changesare relevant to the user It is a conjunction ofconstraints on some properties of d A con-straint is a triple p op vS where p is aproperty of d op is an operator (egfrac14o gty) depending on the type of p and v
is a (possibly empty) value of the same type of p exprQ is an expression used to select notifica-
tions for quality changes occurred on dataobjects selected by dfrac12 S exprdS according tothe values of the quality data Even thisexpression is a conjunction of constraints iea triple pqd op tS where pqd is a qualitydimension of property p of data class d op is acomparison operator ethfrac14o gtpXTHORN and v is anumeric value ranging from 0 to 1
At least d must be specified while other para-meters can be left empty (ie set to gt)8
Once a user has subscribed for some notifica-tions he will receive them in the same form theyare produced by quality factory within eachorganization that is described below
Notifications of quality changes The QualityFactory component of each organization pushesevents to the Quality Notification Service in thefollowing form
DQevfrac14 d d fyqd vSiygS
where d is a data class of the global schema d is adata object of class d and the last element is a setof pairs qd vS where qd is a quality dimensionand v is the value of the associated quality datathat express the new values of the qualityattributes of data object d
Subscription matching and containment Asaforementioned users will receive only thosenotifications that match their subscriptions Anotification N frac14 dN dN fyqd vSiygS pro-duced by a source s matches a subscription S frac14dSfrac12 S exprD exprQS iff
(1)
the data classes of N and S coincide ie dN frac14dS and if S is specified (ie Sagt) then sAS
(2)
the data object dN satisfies exprD of S
(3)
the set fyqd vSiyg satisfies exprQ of S
Notification matching against a set of subscrip-tions can be efficiently evaluated using well knownalgorithms eg [3132]
We now introduce the concept of containment
between subscriptions As shown in the followingsubscription containment is useful in order toreduce both the use of memory and the networktraffic due to the services provided by the QualityNotification Service Intuitively a subscription S1
contains another subscription S2 if all notificationsthat match S2 also match S1 More formally giventwo subscriptions S1 frac14 d1frac12 S1 exprD1
exprQ1S
and S2 frac14 d2frac12 S2 exprD2 exprQ2
S then S1 con-
tains S2 iff
(1)
d1 is an ancestor node of d2 in the D2Q acyclic
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
Table 3
Examples of subscription to Quality Notification Service
Subscription
S1 Enterprise=HoldergtgtSS2 Enterprise=HoldergtNameAccuracyp medium SS3 Enterprise=HolderFiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
S4 Enterprise=Holder fINPS INAILgFiscalCode frac14lsquolsquoMCLMSM73H17H501F rsquorsquoyyNameAccuracyp medium S
9M
is con
numb
check
M Scannapieco et al Information Systems 29 (2004) 551ndash582574
graph and if both S1agt and S2agt thenS2DS1
(2)
for each data object d that satisfies exprD2 d
also satisfies exprD1
(3)
analogously for each set fyqd vSiyg thatsatisfies exprQ2
also exprQ1is satisfied by
fyqd vSiyg
It is possible to show that subscription contain-ment is a transitive relationship among subscrip-tions ie if S1 contains S2 that contains S3 then S1
contains S3 Furthermore it is also possible toshow that checking the containment of a subscrip-tion within another requires polynomial time9
Quality Notification Service Running ExamplesIn order to illustrate the defined language and thenotions of matching and containment we nowpresent some examples based on the global schemaof the running example
Consider the set of example subscriptions inTable 3 Subscription S1 refers to each qualitychange notifications for data objects belonging tothe class Holder (son of class Enterprise) fromany data source in the system The followingsubscriptions limit the set of interesting qualitychange notifications In particular S2 subscribesfor notifications of quality changes in the Name
attribute of objects of class Holder when theaccuracy of this attribute falls under or is equal tomedium Subscription S3 further restricts the set ofdata objects of interest to holders whose fiscal code
ore precisely if the complexity of checking if a constraint
tained within another is constant being n the sum of the
er of constraints contained in two subscriptions then the
costs Oethn2THORN
is equal to a specific value Finally subscription S4
is the same as the previous one but involves onlydata objects stored in two specific sources
It is easy to see that subscription S2 is containedin subscription S1 and contains S3 that in turncontains S4 Therefore S1 contains all others
Concerning matching a valid quality changeevent generated by a Quality Factory might be
Enterprise=Holder Name frac14 lsquolsquoMMecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
The afore event is received by Quality NotificationService that notifies all interested users Thenotification has the same form of the event pushedby QF In the case of the example subscriptionsgiven in Table 3 it is easy to see that thenotification matches S1 S2 and S3 Matchingwith subscription S4 depends on the organizationin which the quality change occurs
52 Overview of Quality Notification Service
Implementation
The Quality Notification Service is implementedand deployed as a distributed service Eachorganization hosts an independent instance ofthe Quality Notification Service that adopts thearchitecture described in this section A QualityNotification Service instance accepts subscriptionsonly from users inside the organization it resides in(ie its local users) and receives notifications fromthe local Quality Factory Quality NotificationService instances communicate among each otherin a peer-to-peer fashion That is a QualityNotification Service10 acts as a producer ofinformation when it has to propagate to otherQuality Notification Services the notificationsproduced by the Quality Factory of its organiza-tion On the other hand QNS acts as a consumer
of information when it receives notificationsproduced in other organizationsrsquo Quality Notifica-
10 In the following for simplicity we refer to a Quality
Notification Service instance as lsquolsquoa Quality Notification
Servicersquorsquo whenever this is not confusing
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 575
tion Services to be dispatched to its local usersClearly a Quality Notification Service can be aproducer only for data classes of the global schemafor which its organization is a source
The design of the internal architecture of theQuality Notification Service has to face two mainproblems (i) to cope with the heterogeneity ofplatforms and network infrastructures building thesystem and (ii) to scale to the large size we canexpect in a CIS context
Concerning heterogeneity a standard technolo-gical solution ie Web Services is exploited forthe communications among Quality NotificationService instances in order to achieve independencefrom the specific technological platform of eachorganization
Concerning scale the high number of possibleusers each possibly issuing several subscriptionsand the high rate of notifications possibly pro-duced by the Quality Factories are factors thatcould impact the performance of the system interms of computational resources and networkbandwidth In particular the high number ofsubscriptions that could be issued throughout thewhole system imposes a high memory andcomputational overhead for storing and matchingeach notification while possible frequent notifica-tions may generate a large network traffic
Both problems of heterogeneity and scale arefaced through a layered architecture in which alayer (namely the External Diffusion Layer EDL)deals with diffusing subscriptions and notificationsamong all organizations while another layer(namely the Internal Interface Layer IIL) dealswith handling subscriptions and notifications forall local users The subscriptions of local users inan organization are stored inside the IIL That iseach Quality Notification Service instancelsquolsquoknowsrsquorsquo only its local usersrsquo subscriptions buthas only a partial knowledge about the subscrip-tions issued by users of other organizationsKnowledge about non-local subscriptions is pro-pagated selectively by exploiting containmentrelations as we detail in the following limitingmemory consumption and network traffic
In the following we describe in further detailhow the Quality Notification Service works bypresenting two solutions namely Merge Subscrip-
tions and Diffusion Trees exploited in the layeredarchitecture of the Quality Notification Service inorder to deal with scalability
Merge subscriptions Under a complete absenceof lsquolsquoknowledgersquorsquo about subscriptions when aQuality Notification Service acts as a producereach notification should be blindly flooded to allother Quality Notification Services in order tosurely reach all interested users This wouldcause much unnecessary network traffic aseach notification would be received also byQuality Notification Services with no interestedlocal users
To avoid this the EDL of the Quality Notifica-tion Service in the generic organization Oi main-tains a particular subscription namely the merge
subscription denoted as Mi Mi contains all andonly the subscriptions of all the local users of Oi
and is computed from the union of the latter Forexample the merge of subscriptions S2 S3 and S4
in the examples of Table 3 is the subscription S2
itself Mi allows to receive all and only notifica-tions matching some subscriptions issued by someusers of Oi
The merge subscription of each consumerorganization is sent to all the producers ofnotifications possibly matching the merge sub-scription itself On the consumer side the schemamapping is exploited to determine the set ofpossible producers for a subscription This allowsfor further reducing the Quality NotificationService instances involved in the communicationwith a given consumer only to those that canactually generate interesting notifications
Upon a new notification a Quality NotificationService acting as a producer has to check whetherthe notification matches the merge subscription ofsome other Quality Notification Services Let usshow this with the example depicted in Fig 17Consider a producer Quality Notification ServiceQNSCoC and two consumers QNSINPS andQNSINAIL Suppose QNSINPS manages three usersrespectively interested in subscriptions S1 S2 andS3 of example in Table 3 while QNSINAIL managestwo users interested in subscription S3 and S4The merge subscriptions for QNSINPS andQNSINAIL are respectively subscription S1 andS3 These are stored in QNSCoC Now consider the
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
QNSINPS
S1S2S3
QNSINAIL
S3S4
QNS1 S1QNS2 S3
QNSCoC
QFe1e2
e1
e1
e2
U13
U12
U11
U21
U22
Fig 17 Exploiting Merge Subscriptions
M Scannapieco et al Information Systems 29 (2004) 551ndash582576
two following quality changes events generated inCoC for the dataclass class holder
e1 Enterprise=Holder Name frac14 lsquolsquoM MecallarsquorsquoAddress frac14
lsquolsquoVia dei Gracchi 71 Romarsquorsquo FiscalCode frac14
lsquolsquoMCLMSM73H17H501F rsquorsquo fnameAccuracy frac14 lowSgS
e2 Enterprise=HolderName frac14 lsquolsquoMScannapiecorsquorsquoAddress frac14
lsquolsquoVia Mazzini 27 Pagani ethSATHORNrsquorsquoFiscalCode frac14
lsquolsquoSCNMNC74S70G230T rsquorsquo fnameAccuracy frac14 mediumSgS
e1 matches S1 S2 and S3 while e2 matches onlyS1 Due to the afore propagation of mergesubscriptions QNSCoC can avoid the sending ofe2 to QNSINAIL as it can state from the mergesubscriptions that QNSINAIL has no users inter-ested in e2 On the other hand e1 is sent to bothconsumers Quality Notification Services as theyall have users interested in it However note thate2 is forwarded only to user U11 in QNSINPS ande1 is not forwarded to user U22 in QNSINAILHowever the additional matching check requiredto determine the users of a notification is handledby consumer Quality Notification Services inde-pendently from the producer
Let us finally point out that also subscriptiontraffic is reduced due to the use of merge
subscriptions Indeed only changes in subscriptionthat provoke a change in a merge subscription ofan organization have to be propagated to produ-cers For example if a new user in QNS1
subscribes to subscription S2 this is not commu-nicated to QNSp because it does not affect themerge subscription that remains S1
Overall the use of merge subscriptions requiresan additional computational step when a newsubscription and a new notification are issued Asvaluable counterpart the step allows to let flowbetween consumer and producer organizationsonly the necessary subscriptions and notificationscutting off all the lsquolsquouselessrsquorsquo inter-organizationnetwork traffic
Nevertheless even the necessary traffic gener-ated by notifications could be enough to overloadsome Quality Notification Services The followingsection describes a mean to reduce the probabilityof this event
Diffusion trees If a large number of QualityNotification Service consumers is interested innotifications coming from the same producerQuality Notification Service the latter could beforced to open a large number of concurrentnetwork connections to reach all of them In thepresence of a high rate of notifications thescalability of the system could result compromisedTo reduce the impact of this phenomena eachproducer Quality Notification Service once deter-
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
QNSp
QNS2
QNS5
QNS1
QNS4
QNS6
QNS3
QFe1
e1lt65gt
e1lt43gt
e1lt gte1lt gt
e1lt gt
e1lt gt
Fig 18 Diffusion tree example
M Scannapieco et al Information Systems 29 (2004) 551ndash582 577
mined all the consumers of a notification does notdirectly send it to them but it rather calculates adiffusion tree rooted at the producer and span-ning all the consumers The fan out of each nodeof the three is evaluated on the basis of informa-tion about the network connectivity capabilities ofeach network node Each notification is thus sentfrom a node of the tree to its sons along with theinformation about the subtree of further consu-mers interested in the notification (see Fig 18)This approach enables (i) to limit the number ofconcurrent network connections managed by aproducer and (ii) to distribute the load ofnotification propagation among all consumersaccording to their capabilities Therefore QualityNotification Services at intermediate levels of thetree act as forwarders of the notification for thelower levels That is upon receiving anotification they also send it again to the QualityNotification Services at the immediate lower levelin the tree
53 Quality notification service internal
architecture
Fig 19 depicts the internal architecture of ageneric Quality Notification Service instance andshows also the internal fine-grained functionalcomponents of each of the Quality NotificationService layers In the remainder of this section wedetail these components
Internal Interface Layer (IIL) The IIL consumerside comprises the following components
Interpreter receives subscriptions from localusers interprets the subscription language and
stores them into the Local Subscription Tableie a data structure storing all the subscriptionsof all local users of an organization
Local Matcher receives quality change notifica-tions from the quality factory and from theEDL and then matches them against the LocalSubscription Table in order to return notifica-tions to interested local users
Local Dispatcher implements the dispatchingmechanisms used to actually send notificationsto local users
The implementation of the subscription anddispatching mechanisms depend on the specificinternal communication infrastructure of eachorganization ie the Interpreted and the LocalDispatcher can be customized for each QualityNotification Service instance taking into accountspecific technological issues (eg pre-existence of apublishsubscribe middleware)
On the producer side the IIL receives qualitychange events from the local Quality Factory ofthe organization The data is first checked insidethe organization via the Local Matcher to bedispatched to interested local users The data isalso transferred to components of the EDL topropagate it outside the organization
External diffusion layer ethEDLTHORN As aforemen-tioned this layer is responsible for inter-organiza-tion notification diffusion We explained in Section52 that the communication among differentorganization is based on Web Services technologyThis means that each EDL instance exposes a WebService interface containing two methods namelynotify() (on the consumer side) and change-
Sub() (on the producer side) that are used toreceive from other Quality Notification Services anotification and a change of merge subscriptionrespectively
The producer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
External Matcher it checks the Quality Notifi-cation Service instances to which a notificationhas to be sent by matching the notificationagainst the merge subscriptions of otherQuality Notification Services The External
Merge Table data structure maintains the
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
Quality Factory
LocalSubscription
Table
Interpreter Local Dispatcher
Local Matcher
ExternalMergeTableMerger
ForwarderMerge
SubscriptionTable
External Matcher
Tree Builder
Consumer Producer
notify()subscribe()unsubscribe()
Web Service
changeSub()changeSub()
Web Service
External Diffusion Layer
Internal Interface Layer
SchemaMapper
notify()notify() notify()
Fig 19 Quality Notification Service
11Let us remark that a merge subscription is in the general
case obtained from the union of a set of user subscriptions that
do not satisfy containment Therefore a merge subscription can
be represented as a set of user subscriptions which are stored in
the Merge Subscription Table
M Scannapieco et al Information Systems 29 (2004) 551ndash582578
merge subscriptions of consumer organizationsreceived from the Merger component oftheir Quality Notification Service (seebelow) As a consequence such organizationsare all and only the possible consumers(ie they have at least one subscribed user)for data classes for which Oi is a producerFig 17 shows the External Merge Table forQNSCoC
Tree Builder Determines the actual diffusionpath followed by notifications and starts thenotification diffusion Notifications are propa-gated by invoking the notify() method onconsumer Quality Notification Services Eachinvocation contains also the specification of theremaining parts of the diffusion tree to whichthe notification has to be forwarded by thereceiver as shown in Fig 18
The consumer side of the EDL of a QualityNotification Service instance comprises the follow-ing components
Merger determines the merge subscription forthe organization when a subscribeunsubscriberequest occurs A data structure namely the
Merge Subscriptions Table maintains the mergesubscription of the organization as seen by eachproducer Quality Notification Service11 TheMerger sends updates to producers wheneverthe merge subscription changes As aforemen-tioned producers are identified using theSchema Mapper
Forwarder responsible for sending the notifica-tion to all the Quality Notification Services atlower levels of the diffusion tree by invokingtheir notify() method
Let us conclude this section by pointing out that inessence Quality Notification Service is a content-
based publishsubscribe system ie users cansubscribe by specifying information about thecontent of the notifications they are interested inImplementing content-based pubsub systems inlarge scale distributed systems is a hard task inparticular when aiming to implement efficient
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 579
matching and routing algorithms (see Section 6 forfurther details) However in this work the designof solutions for facing these issues has beensimplified by taking into account the existence ofother items of the DaQuinCIS framework (inparticular of the D2Q model of the SchemaMapper and of the Quality Factory)
6 Related work
Data Quality has been traditionally investigatedin the context of single information systemsOnly recently there is a growing attentiontowards data quality issues in the context ofmultiple and heterogeneous information systems[33334]
In cooperative scenarios the main data qualityissues regard [3] (i) assessment of the quality of thedata owned by each organization (ii) methods andtechniques for exchanging quality information(iii) improvement of quality within each cooperat-ing organization and (iv) heterogeneity due to thepresence of different organizations in general withdifferent data semantics
For the assessment (i) issue some of the resultsalready achieved for traditional systems can beborrowed In the statistical area a lot of work hasbeen done since the late 60rsquos Record linkagetechniques have been proposed most of thembased on the Fellegi and Sunter model [35] Alsoedit imputation methods based on the Fellegi andHolt [36] model have been proposed in the samearea Instead in the database area record match-ing techniques (eg [37]) and data cleaning tools(eg [38]) have been proposed as a contribution todata quality assessment
Methods and techniques for exchanging qualityinformation (ii) have been only partially addressedin the literature In [33] an algorithm to performquery planning based on the evaluation of datasourcesrsquo quality specific queriesrsquo quality and queryresultsrsquo quality is described The approach of [33]is different from our approach mainly because weassume to have quality metadata associated toeach quality value this gives us the possibility ofmore accurate answers in selecting data on thebasis of quality This is specifically important for
CISrsquos in which the requirements on data to beexchanged are very strict
When considering the issue of exchanging dataand the associated quality a model to export bothdata and quality data needs to be defined Someconceptual models to associate quality informa-tion to data have been proposed that include anextension of the Entity-Relationship model [39]and a data warehouse conceptual model withquality features described through the DescriptionLogic formalism [40] Both models are thought fora specific purpose the former to introduce qualityelements in relational database design the latter tointroduce quality elements in the data warehousedesign Whereas in the present paper the aim is toenable quality exchanging in a generic CISindependently of the specific data model andsystem architecture In [41] the problem of thequality of web-available information has beenfaced in order to select data with high qualitycoming from distinct sources every source has toevaluate some pre-defined data quality para-meters and to make their values available throughthe exposition of metadata Our proposal isdifferent as we propose an ad-hoc service thatbrokers data requests and replies on the basis ofdata quality information Moreover we also takeinto account improvement features (iii) that arenot considered in [41]
The heterogeneity (iv) issues are taken intoaccount by data integration works Data integra-tion is the problem of combining data residing atdifferent sources and providing the user with aunified view of these data [8] As described in [42]when performing data integration two differenttypes of conflicts may arise semantic conflicts dueto heterogeneous source models and instance-level
conflicts due to what happens when sources recordinconsistent values on the same objects The DataQuality Broker described in this paper is a systemsolving instance-level conflicts Other notableexamples of data integration systems within thesame category are AURORA [42] and the systemdescribed in [43] AURORA supports conflicttolerant queries ie it provides a dynamicmechanism to resolve conflicts by means of definedconflict resolution functions The system describedin [43] describes how to solve both semantic and
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582580
instance-level conflicts The proposed solution isbased on a multidatabase query language calledFraQL which is an extension of SQL with conflictresolution mechanisms Similarly to both suchsystems the Data Quality Broker supports dy-namic conflict resolution but differently fromthem the Data Quality Broker relies onto qualitymetadata for solving instance-level conflicts Asystem that also takes into account metadata forinstance-level conflict resolution is described in[44] Such a system adopts the ideas of the ContextInterchange framework [45] therefore contextdependent and independent conflicts are distin-guished and accordingly to this very specificdirection conversion rules are discovered on pairsof systems
For what concerns publishsubscribe commu-nication systems research has focused on algo-rithms to face large-size scenarios considering forexample efficient information diffusion [46ndash48] orefficient matching in content-based systems[3149] while other works take into account QoSparameters such as reliable message delivery [50]Descriptions of practical experiences with pubsubsystems can be found in [5152] To the best of ourknowledge the Quality Notification Service is thefirst content-based pubsub infrastructure specifi-cally designed for a cooperative context exploitinga data integration approach As a consequenceand differently from lsquolsquogenericrsquorsquo content-based pubsub-systems Quality Notification Service is en-abled to exploit other services such as the SchemaMapper to improve matching and routing ofsubscriptions and notifications
7 Conclusions and future work
Managing data quality in CISrsquos merges issuesfrom many research areas of computer sciencesuch as databases software engineering distrib-uted computing security and information sys-tems This implies that the proposal of integratedsolutions is very challenging In this paper anoverall architecture to support data quality man-agement in CISrsquos has been proposed The specificcontribution of the paper are (i) a model for dataand quality data exported by cooperating organi-
zations (ii) the design of a service for querying andimproving quality data and (iii) the design of aservice notifying changes on quality data tointerested organizations
Up to now the basic mode of the Data QualityBroker has been implemented We plan toinvestigate in future work the research issuesdescribed throughout the paper concerning theoptimized invocation mode We are currentlyinvestigating techniques for query optimizationbased on both data quality parameters and QoSparameters The optimization based on dataquality parameters can be performed on the basisof statistics on quality values As an examplesome values could be calculated on current dataexported by organizations (eg lsquolsquothe mediumaccuracy of citizens currently provided by organi-zation A is highrsquorsquo) As another example someother values could be calculated on thebasis of past performances of organizations inproviding good quality data (eg lsquolsquothe accuracy ofcitizens provided by organization A in thelast N queries is highrsquorsquo) In such a way the numberof possible query answers is reduced by eliminat-ing those that may not be conform toquality requirements according to the statisticalevaluation
The complete development of a framework fordata quality management in CISrsquos requires thesolution of further issues
An important aspect concerns the techniques tobe used for quality dimension measurement Inboth statistical and machine learning areas sometechniques could be usefully integrated in theproposed architecture The general idea is tohave quality values estimated with a certainprobability instead of a deterministic qualityevaluation In this way the task of assigningquality values to each data value could beconsiderably simplified
We have included an improvement feature atquery processing time in the Data Quality Brokernevertheless we still need to investigate someaspects of the instance reconciliation semanticsWersquod like to study how such an on-line improve-ment could be combined with an off-line improve-ment consisting of record matching proceduresperiodically run To this aim the record matching
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582 581
algorithm already used within the Data QualityBroker can also be used as a stand-alone algorithmperforming off-line improvement
Some further open issues concern data qualitymanagement Data ownership is an important oneFrom a data quality perspective assigning aresponsibility on data helps to actually engageimprovement actions as well as with reference toour architecture to trust sources providing dataIn some cases laws help to assign responsibilitieson data typologies but this is not always possibleWe are currently working on a trust model thefirst results of which are published in [9]
Acknowledgements
This work has been carried out in the context ofthe DaQuinCIS project therefore we would like tothanks all people involved in the project especiallyBarbara Pernici Pierluigi Plebani and Sara TucciPiergiovanni
Special thanks are due to Carlo Batini TizianaCatarci and Maurizio Lenzerini for their supportand useful discussions
Finally special thanks are due to the students whoeffectively realized the DaQuinCIS prototype duringtheir master thesis Luca De Santis Diego MilanoValeria Mirabella and Gabriele Palmieri Withoutthem all such a work would be not a reality
References
[1] G De Michelis E Dubois M Jarke F Matthes
J Mylopoulos M Papazoglou K Pohl J Schmidt
C Woo E Yu Cooperative Information Systems A
Manifesto in M Papazoglou G Schlageter (Eds)
Cooperative Information Systems Trends amp Directions
Accademic Press New York 1997
[2] C Batini M Mecella Enabling Italian e-Government
through a cooperative architecture IEEE Computer 34 (2)
[3] P Bertolazzi M Scannapieco Introducing Data Quality
in a Cooperative Context in Proceedings of the 6th
International Conference on Information Quality
(ICIQrsquo01) Boston MA USA 2001
[4] R Wang A Product Perspective on Total Data Quality
Management Commun ACM 41 (2)
[5] Y Wand R Wang Anchoring data quality dimensions in
ontological foundations Commun ACM 39 (11)
[6] T Redman Data Quality for the Information Age Artech
House 1996
[7] C Cappiello C Francalanci B Pernici P Plebani
M Scannapieco Data quality assurance in cooperative
information systems a multi-dimension quality certificate
in Proceedings of the ICDTrsquo03 International Workshop
on Data Quality in Cooperative Information Systems
(DQCISrsquo03) Siena Italy 2003
[8] M Lenzerini Data integration a theoretical perspective
in Proceedings of the 21st ACM Symposium on Principles
of Database Systems (PODS 2002) Madison Wisconsin
USA 2002
[9] L De Santis M Scannapieco T Catarci Trusting data
quality in cooperative information systems in Proceedings
of the 11th International Conference on Cooperative
Information Systems (CoopISrsquo03) Catania Italy 2003
[10] P Aimetti C Batini C Gagliardi M Scannapieco The
lsquolsquoServices To Enterprisesrsquorsquo Project an Experience of Data
Quality Improvement in the Italian Public Administration
in Experience Paper at the 7th International Conference on
Information Quality (ICIQrsquo02) Boston MA USA 2001
[11] P Hall G Dowling Approximate string matching ACM
Comput Surveys 12 (4)
[12] L Pipino Y Lee R Wang Data quality assessment
Commun ACM 45 (4)
[13] D Ballou R Wang H Pazer G Tayi Modeling
information manufacturing systems to determine informa-
tion product quality Manage Sci 44 (4)
[14] P Missier M Scannapieco C Batini Cooperative
Architectures Introducing Data Quality Technical Report
14-2001 Dipartimento di Informatica e Sistemistica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Roma Italy 2001
[15] A Tansell R Snodgrass J Clifford S Gadia AS (Eds)
Temporal Databases Benjamin-Cummings Menlo Park
CA 1993
[16] E Codd Relational database a practical foundation for
productivity Commun ACM 25 (2)
[17] R Bruni A Sassano Errors detection and correction in
large scale data collecting in Proceedings of the Fourth
International Conference on Advances in Intelligent Data
Analysis Cascais Portugal 2001
[18] M Fernandez A Malhotra J Marsh M Nagy
N Walshand XQuery 10 and XPath 20 Data Model
W3C Working Draft available from http
wwww3orgTRquery-datamodel (November 2002)
[19] H Thompson D Beech M Maloney N Mendelsohn
XML Schema Part 1 Structures W3C Recommendation
available from httpwwww3orgTRxmlschema-1
(May 2001)
[20] P Biron A Malhotra XML Schema Part 2 Datatypes
W3C Recommendation available from http
wwww3orgTRxmlschema-2 (May 2001)
[21] S DeRose E Maler D Orchard XML Linking Language
(XLink) Version 10 W3C Recommendation available
from httpwwww3orgTRxlink (June 2001)
[22] S DeRose R Daniel P Grosso E Maler J Marsh
N Walsh XML Pointer Language (XPointer) W3C
Recommendation available from httpwwww3org
TRxptr (August 2002)
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313
ARTICLE IN PRESS
M Scannapieco et al Information Systems 29 (2004) 551ndash582582
[23] J Ullman Information Integration Using Logical Views
in Proceedings of the 6th International Conference on
Database Theory (ICDT rsquo97) Delphi Greece 1997
[24] S Boag D Chamberlin M Fernandez D Florescu
J Robie J Simeon XQuery 10 An XML Query
Language W3C Working Draft available from
httpwwww3orgTRxquery (November 2002)
[25] D Milano Tesi di Laurea in Ingegneria Informatica
Universita di Roma lsquolsquoLa Sapienzarsquorsquo Facolta di Ingegneria
(2003 (in Italian) (the thesis is available by writing an
e-mail to monscandisuniroma1it))
[26] C Hwang K Yoon Multiple Attribute Decision Making
in Lectures Notes in Economics and Mathematical
Systems Vol 186 Springer Berlin 1981
[27] T Saaty The Analytic Hierarchy Process McGraw-Hill
New York 1980
[28] A Buchmann F Casati L Fiege M Hsu M Shan
(Eds) Proceedings of the 3rd VLDB International Work-
shop on Technologies for e-Services (VLDB-TES 2002)
Hong Kong China 2002
[29] T Chandra S Toueg Unreliable failure detectors for
reliable distributed systems J ACM (JACM) 43 (2) (1996)
225ndash267
[30] M Fischer N Lynch M Paterson Impossibility of
distributed consensus with one faulty process J ACM
(JACM) 32 (2) (1985) 374ndash382
[31] MK Aguilera RE Strom DC Sturman M Astley
TD Chandra Matching events in a content-based
subscription system in Symposium on Principles of
Distributed Computing 1999 pp 53ndash61
[32] F Fabret HA Jacobsen F Llirbat J Pereira
KA Ross D Shasha Filtering algorithms and imple-
mentation for very fast publishsubscribe systems SIG-
MOD Record (ACM Special Interest Group on
Management of Data) 30 (2) (2001) 115ndash126
[33] F Naumann U Leser J Freytag Quality-driven integra-
tion of heterogenous information systems in Proceedings
of 25th International Conference on Very Large Data Bases
(VLDBrsquo99) Edinburgh Scotland UK 1999
[34] L Berti-Equille Quality-extended query processing for
distributed processing in Proceedings of the ICDTrsquo03
International Workshop on Data Quality in Cooperative
Information Systems (DQCISrsquo03) Siena Italy 2003
[35] I Fellegi A Sunter A theory for record linkage J Am
Stat Assoc 64
[36] IP Fellegi D Holt A systematic approach to automatic
edit and imputation J Am Stat Assoc 71 (1976) 17ndash35
[37] M Hernandez S Stolfo Real-world data is dirty data
cleansing and the mergepurge problem J Data Min
Knowledge Discovery 1 (2)
[38] H Galhardas D Florescu D Shasha E Simon An
extensible framework for data cleaning in Proceedings of
the 16th International Conference on Data Engineering
(ICDE 2000) San Diego CA USA 2000
[39] R Wang H Kon S Madnick Data quality requirements
analysis and modeling in Proceedings of the Ninth
International Conference on Data Engineering (ICDE
rsquo93) Vienna Austria 1993
[40] M Jarke M Lenzerini Y Vassiliou P Vassiliadis (Eds)
Fundamentals of Data Warehouses Springer Berlin
1995
[41] G Mihaila L Raschid M Vidal Using quality of data
metadata for source selection and ranking in Proceedings
of the Third International Workshop on the Web and
Databases (WebDBrsquo00) Dallas Texas 2000
[42] L Yan M Ozsu Conflict tolerant queries in AURORA
in Proceedings of the Fourth International Conference on
Cooperative Information Systems (CoopISrsquo99) Edin-
burgh Scotland UK 1999
[43] K Sattler S Conrad G Saake Interactive example-
driven integration and reconciliation for accessing data-
base integration Inform Syst 28
[44] K Fan H Lu S Madnick D Cheung Discovering and
reconciling value conflicts for numerical data integration
Inform Syst 28
[45] S Bressan CH Goh K Fynn M Jakobisiak
K Hussein H Kon T Lee S Madnick T Pena J
Qu A Shum M Siegel The context interchange mediator
prototype in Proceedings of the ACM SIGMOD Inter-
national Conference on Management of Data Tucson
Arizona USA 1997
[46] P Eugster S Handurukande R Guerraoui A Kermar-
rec P Kouznetsov Lightweight probabilistic broadcast
in Proceedings of the International Conference on
Dependable Systems and Networks (DSN 2001) July
2001
[47] R Preotiuc-Pietro J Pereira F Llirbat F Fabret
K Ross D Shasha Publishsubscribe on the web at
extreme speed in Proceedings of the ACM SIGMOD
Conference on Management of Data Cairo Egypt 2000
[48] M Castro P Druschel A Kermarrec A Rowston
Scribe A large-scale and decentralized application-level
multicast infrastructure IEEE J Select Areas Commun
20 (8)
[49] A Campailla S Chaki EM Clarke S Jha H Veith
Efficient filtering in publish-subscribe systems using
binary decision diagrams in Proceedings of the Interna-
tional Conference on Software Engineering 2001
pp 443ndash452
[50] KP Birman The process group approach to reliable
distributed computing Commun ACM 12 (36) (1993)
36ndash53
[51] KP Birman A review of experiences with reliable
multicast Software-Pract Exper 9 (29) (1999) 741ndash774
[52] R Piantoni C Stancescu Implementing the Swiss
exchange trading system in Proceedings of the 27th
Annual International Symposium on Fault-Tolerant
Computing (FTCSrsquo97) June 1997 pp 309ndash313