[IEEE 2009 WRI World Congress on Software Engineering - Xiamen, China (2009.05.19-2009.05.21)] 2009 WRI World Congress on Software Engineering - Putting Feedback into Incremental Schema

Putting Feedback into Incremental Schema Matching

Zhao Cao, Kan Li, Yushu Liu Beijing Laboratory of Intelligent Information Technology, School of Computer Science

Beijing Institute of Technology Beijing, 100081, P.R.China

{zhaoyang,likan,liuyushu}@bit.edu.cn

Abstract—The goal of schema matching is to identify correspondences between the elements of two schemas. Current state-of-the-art schema matching systems calculate the candidate matching result in one time and display the results in a single-shot for user. However, they do not take user’s feedback into consideration. In this paper, first, a similarity and dissimilarity measurement between two schemas is provided. Then, we propose an algorithm that displays the schema layout in an easily understood style for the user. Third, we propose an interactive and incremental matching model taking the user’s feedback into account and provide a confirmation plan in real time to decrease the dissimilarity as quickly as possible. Finally, we integrate all state-in-the-art schema matching algorithms into a general interactive and incremental schema matching framework.

Keywords: schema matching, feedback, interactive, database

I. INTRODUCTION Schema matching is the task of building relationships

between the elements of two schemas in heterogeneous and distributed data sources. It is recognized as one of the basic operations required by the process of data integration, data exchange[1][2] and data warehousing.

Schema matching has drawn significant research attention. After two decades of research in this area, various new schema matching algorithms and prototypes [3][4][5] have been proposed in the database and artificial intelligence literature. And also, a number of commercial visual programming tools are available that helps an engineer to produce mappings, such as Altova MapForce[7], BEA WebLogic Workshop[8], IBM WebSphere[9], Stylus Studio[10], and Microsoft Biz Talk Mapper[11].

Schemas used in enterprise applications often contain thousands of elements. Due to its cognitive complexity, traditional schema matching has been performed by human experts[6] and the mapping development is costly and labor-intensive. However, despite years of intensive research, hardly any of the commercial mapping tools incorporate schema matching techniques.

Figure 1. Overview of schema mapping process

Generally, it is necessary for a human to verify and fine-tune the mappings generated by schema mapping tools. The user interacts with the tool and examines the candidate mappings produced, indicates which ones are correct and which ones are not, and creates additional mappings that the tool has missed. The process is iterative as Figure 1 described, and the user involvement is a critical step. However, the research in this area has been dominated by a concentration on the algorithms that compute candidate matches, the automation of this process and ignores the user. Little even no research has examined the human side if this problem, even though humans struggle to perform mappings even for small representation. In order to move beyond research labs for mapping tools, we need to begin focusing on the user’s need during the mapping process.

Some users had a difficult time to understand the mappings suggested by tools, but finally, they even eventually ignored them and relied solely on their own exploration of the schema. The past schema matching approaches result in the following problems:

• Calculate the candidates only in one time; • Display all the correspondence in a single-shot; • Do not consider the feedback of the user’s

confirmation; • Do not consider the contribution of each matching; • It is just the combine of some techniques but do not

take advantages of each technique. The above five problems lead to the low-efficiency and

overburden of the user. It is unlikely that improved precision and recall will yield big productivity gains for the data architect who is developing an engineered mapping between independently developed schemas. This is especially true for mapping tasks that are unrelated to previous ones, where there are no validated mappings to reuse. We believe that the biggest productivity gains will come from better user interface[19][20], but not from more accurate schema matching algorithms. Examples include helping the user focus on the schema elements of interest by dynamically reorganizing them to fit on one screen and providing workflow assistance to track what the user knows about elements that he has already examined.

All the existing approaches can provide the candidate mapping in an interactive and easy to operate style, but only limited or no user feedback was taken into consideration. Another thing is that, some matching is more important than the others, this confirmation of this essential would provide much heuristic information to the next candidate

World Congress on Software Engineering

978-0-7695-3570-8/09 $25.00 © 2009 IEEE

DOI 10.1109/WCSE.2009.373

332


978-0-7695-3570-8/09 $25.00 © 2009 IEEE

DOI 10.1109/WCSE.2009.373

332


978-0-7695-3570-8/09 $25.00 © 2009 IEEE

DOI 10.1109/WCSE.2009.373

332


978-0-7695-3570-8/09 $25.00 © 2009 IEEE

DOI 10.1109/WCSE.2009.373

332

computation and pruning. The last one is that current confirmation process is in a disorderly style, that is, the system does not provide the next candidate matching for confirmation, but let the user select or in a random selection. We can provide user with the best candidate for confirmation that can decrease the dissimilarity of two schemas and also in the order that can make the user understand the meaning of the two schemas in an easy way.

We are interested in investigating how users understand and perform mappings between different data representation, how their cognitive loading during this process can be reduced, how to integrate state-in-the-art algorithms into a general framework.

In this paper we propose to take the user’s feedback into consideration to calculate and prune the candidate in real time and optimize the user interface in incremental schema matching. The remainder of this paper is organized as follows. In the next Section we discuss the related work. In Section III, the similarity and dissimilarity measurement between two schemas are presented, then, we present our schema display and layout algorithm in Section IV. We present our algorithm that put feedback into consideration in Section V. Finally, Section VI concludes the paper and presents our future works.

II. RELATED WORD In state-of-the-art schema matching systems, schema

matches are discovered by considering a wide variety of evidence that may indicate a match. These evidences include similarity of the data, similarity of the schema and metadata information, preservation of constraints, and transitive similarity based on other known mappings[3][4][5]. For example, Cupid[12] is a general system that encompasses a variety of techniques such as linguistic analysis, structural matching and context dependencies. COMA and its successor COMA++[27] rely on fragment-based matching for XML data that is fragmented and then the matching algorithms are deployed over the fragments. Clio[13] focuses on providing a declarative way of specifying schema mappings between schemas. The use of multiple learners to infer mappings between a source schema and a target schema has been proposed in LSD[27], iMAP[26], and COMA[27]. The effectiveness of such an approach lies in that the learners handle different types of information. The overall accuracy of the system thus can be improved when the mappings predicted by different learners are combined.

An alternative approach to handle schema heterogeneity in schema matching is reported in [23], in which a set of schema evolution operators are considered, and schema matching is viewed as a search-problem in the space of schemas induced by applying sequences of these operators to one of the schemas to be matched.

Model management system[17] is a component that supports the creation, reuse, evolution, and execution of mappings between schemas represented in a wide range of meta-models. It seems rather different than the mapping approach of Clio. Most of the works are on two of model management operations: compose[24] and inverse[25] which are on the semantics of query answering.

The previous algorithms and tools concentrate on the correctness of the matching algorithms, however, in [20], Falconer et.al investigate the human decision making process during schema mapping task to reduce the cognitive load of the mapping users. It is clear that cognitive support play an important role in schema matching.

In [19], Robertson proposed series of visualization improvements that enable practical use of much larger schemas and maps, such as highlight propagation, automatic scrolling, coalescing trees, multi-selection, bendable links and so on.

Berstein first proposed incremental schema matching in [16]. It was integrated with a prototype version of Microsoft BizTalk Mapper. The tool suggests candidate matches for a selected schema element and allows convenient navigation between the candidates.

COMA++[27] automatically generates mappings between the source and target schemas, and draws lines between matching terms. Users can also define their own element matches by using the “edit” mode. In this mode, the user can select a source element and target element, and create a mapping between the two elements with a strength of 1.0. This mode also allows users to remove automatically generated mappings. The current mapping state can be saved at any time during the mapping procedure.

Once a user has verified a mapping, PROMPT’s algorithm use this verified mapping to perform structural analysis based on the graph structure of the ontologies. This analysis usually results in further mapping suggestions. This process is repeated until the user determines that the mapping is complete.

III. DISSIMILARITY MEASUREMENT There are various methods to compute the similarity

between two strings, words, articles, but until now, there is not any measurement on the similarity or the distance between two schemas.

Similarity is the corresponding aspects or features between two schemas, but we can not get the similarity until we have got the matching result because the features of the schema can not be extracted in a general way. It is obvious that the dissimilarity of two schemas is 1 if we have not got any corresponding features. But with the processing of getting more and more corresponding features, the dissimilarity decreased. If two schemas matched completely, the similarity is 1, but the dissimilarity is 0. After we get the matching result, we can compute the similarity. We first define the similarity between two schema s and t as Formula 1 described.

( , )( , )( ) ( )map s tSimilarity s t

G s G t=

∪ (1)

where map(s,t) is the number of matching discovered, G(s) and G(t) are the set of schema elements for schema s and t respectively. ( ) ( )G s G t∪ is the union of the

333333333333

schema elements of schema s and t. ( ) ( )G s G t∪ is the

number of element in set ( ) ( )G s G t∪ . We can not get the similarity between s and t before we

obtain the mapping results. In this condition, the necessity of computing the similarity is little, so we introduce the dissimilarity as formula 2 described in our model.

( , )( , ) 1( ) ( )map s tDissimilarity s t

G s G t= −

∪ (2)

We can use the above formula 1 and 2 when the user matches the schema manually, but in automatic schema mapping, the system computes candidate mappings for each element and provides credibility for each candidate mapping, the above formulas can not be used in this scenario.

A schema element has a candidate mapping set, because an element may be similar to more than one element in the target schema, it would be a 1 to n mapping. Every mapping has a strength between 0 to 1 to illustrate the credibility of this mapping. The more elements one would be mapped to, the less credibility we can get, so the similarity could be calculated as formula 3 described.

1 1

( , )i

ijMM

M

i j i

Similarity s tM

ω

= =

=∑∑ (3)

where M is the element set of s or t, |M| is the size of the set M, iM is the candidate mappings for the ith element, iM

is the size of the set iM , ijM is the j-th candidate of

mapping iM , ijMω is the strength or the credibility of

candidate mapping ijM . Consequently, the similarity of two schemas can be

calculated as formula 4 described in the automatic schema matching scenarios.

1 1

( , ) 1i

ijMM

M

i j i

Dissimilarity s tM

ω

= =

= −∑∑ (4)

In the above formulas, the set M is the element set we have obtained the matching. In the automatic and interactive schema mapping scenarios, the strength of a confirmed mapping is 1 and the strength for a non-confirmed mapping is computed by the system automatically.

IV. SCHEMA DISPLAY AND LAYOUT Most of the schema matching tools represent the schema

in a tree style, and list the schema elements randomly. As stated in[16], most of the time spent for schema matching is in understanding the meaning of the schema. For the easy understanding of the schema, we introduce two technologies

into our system. They are displaying the schema in ER diagram and displaying the diagram in user easy understanding layout.

We provide two modes for schema display, one is the classical tree-style, the other is to display the model in an Entity-Relation diagram to represent the relationship of internal schema, because many database designers are more familiar with the ER diagram and the relationship in ER diagram is explicit.

The schema mapping tools like Clio provide automatic scrolling to the candidate matching element if the user selects one schema element in the source schema. But the layout of elements is in random style, that is there may be some elements related greatly but distributed very far, which makes user understand the schema hardly and difficultly. In our system, we will provide the user the ER diagram in an easy understanding layout. We also adapt the layout with the confirmation process to decrease the scrolling of users.

The approach includes the following steps: Step 1: Convert the schema into a graph; Step 2: Analyze the graph, put the nodes and edges in an

easy understanding layout; Step 3: Display the schema as the graph layout mode

computed. In the classical tree-style mode, we will display the

related elements as near as possible so as to decrease the scrolling and make the user understand the schema easily.

The main challenge is in the first and second step. In the first step, there exist many algorithms converting a schema into graph. If the data instance is available, in our approach, we set a weight for each edge to illustrate the correlation between two nodes. The weight can be computed using information theory.

V. PUTTING FEEDBACK INTO CONSIDERATION A characteristic feature of past approaches to schema

matching is that they attempt to calculate the set of correspondences between all schema elements in a single shot. Invariably, the results presented to the engineer include many false positives and uncertain candidates, especially for large schemas. Such false positives and uncertain candidates require a lot of manual clean-up and confirmation. With the processing of clean-up and confirmation, user gives various feedbacks to the system, but existing algorithms do not take these feedbacks into consideration. We consider how to put these feedbacks into the refining and generation of candidate mappings.

In the following, we present the main ideas of how to put feedback into consideration in the interactive and incremental schema matching method. First, we describe the type of the feedback from users and analyze the effect of each type of feedback. We then describe the method how to generate the confirmation plan according to the user interaction input and current mapping status. Finally, we describe how to integrate existing schema matching algorithms into a unified framework based on a general interactive model.

334334334334

A. Types of feedback and their effect Schema matching algorithms generally combine heuristic

measurements based on three general criteria: syntactic similarity between concept terms, semantic similarity between concept terms and finally structural similarity.

In the interactive schema matching system, feedbacks from the users include the following types:

• Confirmation of correct mappings • Pruning of in-correct mapping • Extra information from the schema or added by the

user The confirmation of correct mappings will make the

strength of this mapping be 1, and delete other candidate mappings for this element. This type of feedback can decrease the uncertainty of schema matching. Especially, this can provide evidence and heuristic information to the mapping system. It can also be used to generate new candidates, prune incorrect matching candidates and re-compute the weight of related candidate mappings. For example, if two key elements are matched, we can use the duplicate based schema matching method to match the schema if duplicate can be detected with very low cost. If the data instance is not available, in this scenario, we use the graph theory to generate the new candidates or re-compute the weight of each candidate. It is obvious that if two important nodes in two graphs matched, it can increase the similarity of their neighbor nodes.

The pruning of in-correct matches can also decrease the uncertainty of schema matching. If we are sure some of the evidence or heuristic information would mislead incorrect mappings, then we will not use this type of information in the next computation.

The third type of information is helpful but would be ignored easily for the future mapping generation. It includes the following information.

The first one is declaration of internal schema relationships. Some of the constraints are lost in many schemas, such as key, foreign key constraints, types of some schema elements. If we can provide this information to the schema mapping system, the system can prune many false mappings and increase the reliability of some candidate mappings. So in our system, we will provide user a friendly way to input this type of information as they need.

The second one is the available data instance for each schema. Instance is another important type of heuristic information for schema matching. We can use the instance values, instance pattern and some machine learning method to generate candidate mappings. All these methods are used once at the beginning of the matching process in existing systems, whereas, with the confirmation of some matches and interactions between the user and mapping tools, some methods can be used efficiently and effectively. For example, we confirmed the mapping between two schema elements which are key of each table in a relational model, we can use the method based on duplicate to calculate the new candidate and verify the pre-calculated candidates. In another aspect, we can inject some selected data instances to the system before the schema matching. These data instances can be

used to assist the mapping and debugging the mapping to illustrate the process why these two elements are matched in the system.

B. Confirmation plan generation In existing schema mapping tools, they generate huge

number of candidates for each element and so many lines between these schema elements that make users overwhelmed. Some users even abandon the candidate mapping provided by the system, they build the mapping themselves. As a result, providing a good interface is essential in schema mapping tools. Coma, Clio and BizTalk Mapper introduce the highlight and scrolling technique into the system, even BizTalk lists the candidate mappings for the selected schema element based on the similarity or the reliability of the mapping. These techniques can make the mapping confirmation process more convenient, but so many lines would make the user lose their patience to the tools, even give up the candidate mappings and build mappings themselves. We should provide a simple but useful interface to the user.

In the matching confirmation process, the confirmation of some candidates would provide much information for the dissimilarity decreasing, but others would provide little information to the automatic mapping verification and generation. For example, the confirmation of key element would provide much information, but the confirmation of a common element or element with low mutual information to other elements would provide less information to the next candidate generation.

In our system, we will provide two modes: one is the classical mode like Clio which provide all the candidate in a screen, another is the workflow mode, in which the order of confirmation is directed by the system automatically and in real time. The order of the candidate elements which are confirmed is sorted according to the contribution of each mapping. How to measure the information contribution is the main challenge in the process.

In our approach, we use the graph theory and information theory to compute the information contribution of the mapping in the scenarios with large data instances and no instance existed respectively.

If there exist large number of data instances, the contribution of each matching can be calculated according to the algorithm used in the system. If there is no data instances provided, we use the graph theory to compute the contribution of each mapping.

After the contribution of each mapping is computed, we sort the candidate mappings in descent order, and show the top-k contribution mappings to the user for confirmation. The number of k can be configured as user’s need.

C. How to refine and generate the candidate There are various schema matching methods have been

proposed until now, each type of method has its own advantages and disadvantages in different application fields and scenarios.

Current schema matching heuristic methods include the following types:

335335335335

• Schema-based, the heuristic information includes the linguistic (Lexical Acronyms) similarity, constraints (types, keys) and structure (nesting context, neighborhood) similarity.

• Instance-based, the heuristic information includes instance values, value patterns, machine learning.

• Reuse-based, the heuristic information includes thesaurus and validated matches.

• Action-based, the heuristic information includes recent matches and implicit scope.

How to integrate all these algorithms into a unified approach is an urgent problem to be solved. By analyzing the advantages and disadvantages of these algorithms and their application scenario, we can use the traditional approach for the initial candidate generation; in the confirmation process, we then use different algorithms to update the weight of candidate mappings, generate new candidates and prune incorrect candidates.

The main challenges of interactive and incremental schema matching are to make the algorithms efficiency, but not just re-run the existing algorithms in each step to avoid unnecessary computation cost. Our main idea is reusing the inter-mediate results and selecting the most appropriate algorithm for re-compute the candidate according to the feedback provided from the user.

VI. CONCLUSION AND FUTURE WORK In this paper, we have presented an interactive and

incremental schema mapping approach. The similarity measurement and dissimilarity measurement of two schemas are provided. The main idea of the approach is to put the user’s feedback to the mapping system, and we also provide user a real time confirmation plan that would decrease the dissimilarity most quickly, pruning the incorrect mappings, updating the strength of each candidate matching. We finally integrate existing schema matching algorithms into a general framework which utilize the advantages of each algorithm in different application scenarios.

Significant future work is suggested by this approach on both the theoretical and practical fronts. In the future, we want to study the following problem:

• Refine the similarity measurement to the automatically generated mapping;

• Build the mathematical model of the feedback; • Build the mathematical model for the confirmation.

ACKNOWLEDGMENT The research was supported in part by the Ministerial

Level Advanced Research Foundation and Beijing Key Discipline Program. We appreciate for Pei Sun, Xin Sun, Fei Song and Shibin Zhou’s helpful discussion and advices.

REFERENCES

[1] M. Lenzerini. Data Integration: A Theoretical Perspective. In PODS, pp. 233–246, 2002.

[2] P. G. Kolaitis. Schema mappings, data exchange, and metadata management. In PODS, pp. 61–75, 2005.

[3] E. Rahm, P. A. Bernstein. A Survey of Approaches to Automatic Schema Matching. VLDB Journal, Vol 10(4), pp. 334–350, 2001.

[4] P. Shvaiko and J. Euzenat. A survey of schema-based matching approaches. Journal of Data Semantics, Vol 4, pp. 146 – 171, 2005.

[5] H. Do, S. Melnik, and E. Rahm. Comparison of schema matching evaluations. In Proc. of the 2nd International Workshop on Web Databases (German Informatics Society), pp. 221-237, 2002.

[6] R. Hull. Managing semantic heterogeneity in databases: A theoretical perspective. In Proc. of the ACM SIGACT-SIGMOD-SIGART PODS, pp. 51–61. ACM Press, 1997.

[7] Altova MapForce. http://www.altova.com/productsmapforce.html. [8] BEA WebLogic Workshop. http://www.bea.com/framework.jsp?CNT

=index.htm&FP=/content/products/workshop/. [9] C. Lau. Developing XML Web Services with Websphere Studio

Application Developer. IBM Systems Journal, July 2002. [10] Stylus Studio. http://www.stylusstudio.com/. [11] Microsoft BizTalk Server 2004. BizTalk Mapper.

http://msdn.microsoft.com/ library/en-us/introduction/htm/ebiz intro story jgtg.asp, 2004.

[12] Jayant Madhavan, P. A. B., Erhard Rahm. Generic schema matching with Cupid. In Proc. VLDB, pp. 49-58, 2001.

[13] Laura M.Hass, M. A. H., Howard Ho. Clio Grows Up: From Research Prototype to Industrial Tool. SIGMOD 2005, pp. 805-810, 2005.

[14] P. A. Bernstein, S. M., P. Mork. Interactive Schema Translation with Instance-Level Mappings. In Proc.VLDB, pp. 1283-1286, 2005.

[15] L. Chiticariu, W. T. Debugging Schema Mappings with Routes. In Proc. VLDB, pp 79-90, 2006.

[16] P. A. Bernstein. Incremental Schema Matching. In Proc. VLDB, pp 1167-1170, 2006.

[17] P. A. Bernstein, S. M. Model management 2.0: manipulating richer mappings. In Proc. of the 2007 ACM SIGMOD, pp. 1-12, 2007.

[18] A. Gal. Why is Schema Matching Tough and What Can We Do About It?. SIGMOD Record, Vol 35(4), pp. 2-5, 2007.

[19] George G. Robertson, M. P. C., John E. Churchill. Visualization of Mappings Between Schemas. In Proc. of the SIGCHI conference on Human factors in computing systems, pp. 431 – 439, 2005.

[20] S.M. Falconer and M. Storey. Cognitive Support for Human-Guided Mapping Systems. Tech. Report DCS-318-IR, 2007, Univ. of Victoria.

[21] A. Bilke and F. Naumann. Schema matching using duplicates. In Proc. of ICDE, pp. 69–80, 2005.

[22] M. Sayyadian, Y. Lee, A. Doan, and A. Rosenthal. Tuning schema matching software using synthetic scenarios. In Proceedings of the International conference on very Large Data Bases (VLDB), pp. 994–1005, 2005.

[23] G. H. L. Fletcher and C. M. Wyss. Relational data mapping in MIQIS (demo). In Proc. of the 2005 ACM SIGMOD, pp. 912-914, 2005.

[24] R. Fagin, P.G. Kolaitis, and L. Popa, W.C. Tan. Composing Schema Mappings: Second-order Dependencies to the Rescue. ACM TODS, Vol 30(4), pp. 994-1055, 2005.

[25] R. Fagin, P.G. Kolaitis, L. Popa, and W.C. Tan. Quasi-inverses of Schema Mappings. In Proc. of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pp. 123-132, 2007.

[26] R. Dhamankar, Y. Lee, A. Doan, A. Halevy, and P. Domingos. iMAP: discovering complex semantic matches between database schemas. In Proc. of the 2004 ACM SIGMOD, pp. 383-394, 2004.

[27] H. Do and E. Rahm. COMA - a system for flexible combination of schema matching approaches. In VLDB, pp. 610-621, 2002.

[28] A. Doan, P. Domingos, and A. Y. Halevy. Reconciling schemas of disparate data sources: A machine-learning approach. In Proceedings of 2001 ACM SIGMOD international conference on Management of data, pp. 509-520, 2001.

336336336336

Documents

[IEEE 2009 WRI World Congress on Software Engineering - Xiamen, China (2009.05.19-2009.05.21)] 2009 WRI World Congress on Software Engineering - Putting Feedback into Incremental Schema