27
Page 1 Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence Integrating Multiple Data Sources Integrating Multiple Data Sources using a Standardized XML Dictionary using a Standardized XML Dictionary TR TR Labs Labs - Winnipeg - Winnipeg Ramon Lawrence Ramon Lawrence University of Manitoba University of Manitoba [email protected] [email protected] Supervisor: Dr. Ken Barker Supervisor: Dr. Ken Barker

Page 1 Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence Integrating Multiple Data Sources using a Standardized XML

Embed Size (px)

Citation preview

Page 1

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Integrating Multiple Data Sources Integrating Multiple Data Sources using a Standardized XML Dictionaryusing a Standardized XML Dictionary

Integrating Multiple Data Sources Integrating Multiple Data Sources using a Standardized XML Dictionaryusing a Standardized XML Dictionary

TRTRLabsLabs - Winnipeg - WinnipegTRTRLabsLabs - Winnipeg - Winnipeg

Ramon LawrenceRamon LawrenceUniversity of ManitobaUniversity of Manitoba

[email protected]@cs.umanitoba.caSupervisor: Dr. Ken BarkerSupervisor: Dr. Ken Barker

Ramon LawrenceRamon LawrenceUniversity of ManitobaUniversity of Manitoba

[email protected]@cs.umanitoba.caSupervisor: Dr. Ken BarkerSupervisor: Dr. Ken Barker

Page 2

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Outline

Introduction, Motivation, and Background Integration architecture components Integration architecture Example integration Applications to the WWW Future work and conclusions Demonstration of Unity

Page 3

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Introduction

Integration of data is required when accessing multiple databases within an organization or on the WWW.

Our focus is automatically combining database schema using schema integration.

Schema integration requires knowledge of data semantics and use of metadata.

Page 4

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Motivation

Organizations have several database systems which must interoperate.

Users often access multiple Web databases whose knowledge must be integrated and presented in a useful form.

Data warehouses and OLAP systems require data semantics to be understood and data to be cleansed and summarized.

Page 5

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Background

Schema integration involves combining diverse database schema into an integrated view by resolving conflicts.

Schema conflicts include naming, structural, and semantic conflicts.

Schema integration is required for database interoperability, but it is currently a manual process.

Page 6

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

LDBS

GTS

MDBS Architecture

Global Transaction Manager (GTM)

•processes global transactions•insures information in all LDBSs is consistent•submits subtransactions to the GTSs for each LDBS

Global Transaction Servers (GTSs)•one for each LDBS•converts subtransactions from the GTM into a form usable by the LDBS and vice versa

Local Database Systems (LDBSs)•databases combined into MDBS•unchanged as still process local transactions

GTM

Global Transactions

Local Transactions

subtransactions

GTSGTS GTS

LDBS LDBS LDBS

Page 7

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Previous Work

Research systems: integrating systems by logical rules (Sheth) defining global dictionaries (Castano) Carnot Project using the Cyc knowledge base

Industrial systems and standards: Metadata Interchange Specification (MDIS) XML, BizTalk, E-commerce portals

Page 8

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Architecture Objective

The objective of our architecture is to provide a system for automatically integrating diverse relational schemas into a multidatabase

Desirable properties: individual mappings - information sources integrated

one-at-a-time and independently global view constructed for query transparency handles schema conflicts - including semantic,

structural, and naming conflicts automated global integration - global view

constructed efficiently and automatically

Page 9

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

The Idea

The major idea is that schema conflicts can be resolved if we:

eliminate all naming conflicts define a language capable of determining schema

equivalence and performing transformations

With these two properties, schema conflicts can be resolved automatically at the global level

Page 10

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Architecture Components: The Global Dictionary

A global dictionary (GD) provides standardized terms to capture data semantics.

Hierarchy of terms related by IS-A or Has-A links Contains base set of common database

concepts, but new concepts can be added

A GD term is a single, unambiguous semantic definition.

Several GD entries for a single English word are required if the word has multiple definitions.

Page 11

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Architecture Components:Using the Global Dictionary

GD terms are used to build semantic names to describe the semantics of schema elements.

Semantic names have the form: semantic name = “[“CT [[;CT] | [,CT]] “]” CN CT = context term, CN = concept name each CT and CN is a single term from the GD

Semantic names are included in specifications describing a data source.

Page 12

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon LawrenceArchitecture Components:

X-Specs

Database metadata and semantic names are combined into specifications called X-Specs:

stored and transmitted using XML contains information on a relational schema organized into database, table, and field levels stores semantic names to describe and integrate

schema elements

Page 13

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon LawrenceArchitecture Components:

Integrating X-Specs

Each database to be integrated is described using a X-Spec.

Identical concepts in different databases are identified by similar semantic names.

Concepts with identical (or hierarchially related) semantic names are combined regardless of their physical representation in the individual databases.

Page 14

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Integration Architecture

Our integration architecture consists of two separate phases:

capture process: X-Specs are constructed for each data source independently

integration process: X-Specs are combined using the integration algorithm which matches semantic names using the global dictionary

Page 15

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Integration Architecture:The Capture Process

Capture process involves: automatically extracting the schema information

and metadata using a specification editor assigning semantic names to each schema

element (tables and fields) to capture their semantics

Page 16

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Integration Architecture:The Capture Process

RelationalSchema

GlobalDictionary

X-SpecSpecification

Editor

AutomaticExtraction

DBA Lookupof terms

Page 17

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Integration Architecture:The Integration Process

Integration process involves: automatically identifying identical concepts by

matching semantic names constructing a global view of database concepts

consisting of a hierarchy of concept terms resolving structural differences during query

generation and submission (e.g. a concept may be represented as a table in one database and a field (attribute) in another)

Page 18

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Integration Architecture:The Integration Process

Client

RDBMS

Integration Site

Subtransactions

Client………….

RDBMS……..

X-Spec X-Spec

Page 19

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Integration Architecture Benefits

The benefits of the two phase architecture are:

Dynamic integration: schemas integrated as needed

X-Specs are constructed only once and independent of each other

Automatic conflict resolution by integrating based on semantic name rather than physical structure

Users are isolated from system names and organization by querying through a global view using semantic names for concepts

Page 20

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Integration Example

Two claims databases to be integrated: ABC Company: Claims_tb(claim_id, claimant,

net_amount, paid_amount) XYZ Company: T_claims(id, customer, claim_amt),

T_payments(cid, pid, amount)

First step is to construct X-Specs for each database.

Page 21

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Integration Example:ABC Database X-Spec

Type System Name Semantic Name

Table Claims_tb [Claim]

Field Claim_id [Claim] Id

Field Claimant [Claim;Claimant] NameField Net_amount [Claim] Amount

Field Paid_amount [Claim;Payment] Amount

Page 22

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Integration Example:XYZ Database X-Spec

Type System Name Semantic Name

Table T_claims [Claim]

Field id [Claim] IdField customer [Claim;Customer] Name

Field claim_amt [Claim] Amount

Table T_payments [Claim;Payment]Field cid [Claim] IdField pid [Claim;Payment] Id

Field amount [Claim;Payment] Amount

Page 23

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Integration Example:Integrated View

Global view after integration: [Claim]

Id Net amount [Customer]

name

[Payment] id amount

Page 24

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Integration Example:Discussion

Important points: system and field names are not presented to the

user who queries based on semantic names database structure is not shown to the user different physical representations for the same

concept are combined (e.g. payment (attribute) in ABC with payment table in XYZ database)

hierarchially related concepts (customer vs. claimant) are combined based on their IS-A relationship in the global dictionary

Page 25

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Applications to the WWW

Integrating diverse data sources is involved in constructing a data warehouse and other operational systems.

The WWW is a diverse organizations of databases which users access.

Automatically integrating web data sources by a browser or portal reduces query complexity and integration of results for the user.

Page 26

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Conclusions

Automatic integration of database schema is possible by using a global dictionary of terms and constructing semantic names for schema elements.

Integration of data sources has applications to the WWW and construction of data warehouses.

Page 27

Integrating Multiple Data Sources using a Standardized XML Dictionary Ramon Lawrence

Future Work

The integration architecture is evolving with standards on XML and captures metadata information in XML documents.

The system is being tested on sample problems, and a query mechanism is work-in-progress.

We are refining a prototype of the system called Unity.