View
217
Download
0
Category
Preview:
Citation preview
Automatic Integration of Automatic Integration of Relational Database SystemsRelational Database Systems
Automatic Integration of Automatic Integration of Relational Database SystemsRelational Database Systems
Ramon LawrenceRamon LawrenceUniversity of ManitobaUniversity of Manitoba
umlawren@cs.umanitoba.caumlawren@cs.umanitoba.ca
Ramon LawrenceRamon LawrenceUniversity of ManitobaUniversity of Manitoba
umlawren@cs.umanitoba.caumlawren@cs.umanitoba.ca
Page 2
Outline Introduction, Motivation, and Background Our integration approach The integration architecture
Standard dictionary, X-Specs, query processor
Example integration Northwind, Southstorm databases
Querying the integrated databases Generating SQL queries from semantic queries
Unity implementation Contributions, Conclusions, and Future Work
Page 3
Database Terminology Database system - is a database and a system to manage
the data. Schema - is a description of the data organization and
format in a database. Schema integration - is the process of combining local
schemas into a global, integrated view by resolving conflicts present between the schemas.
Data integration - is the process of combining data at the entity-level. It requires resolving representational conflicts and determining equivalent keys.
Multidatabase system (MDBS) - is a collection of autonomous, local databases participating in a global database system to share data.
Page 4
What is Integration? Two levels of integration:
Schema integration - the description of the data Data integration - the individual data instances
Integration problems include: Different data models and conflicts within a model Incompatible concept representations Different user or view perspectives Naming conflicts (homonym, synonym)
Integration handles the different mechanisms for storing data (structural conflicts), for referencing data (naming conflicts), and for attributing meaning to the data (semantic conflicts).
Page 5
Why is Integration Required? There are many integration environments:
Operational systems within an organization System integration during company merger Data warehouses, Intranets, and the WWW
Users require information from many data sources which often do not work together.
Companies require a global view of their entire operations which may be present in numerous operational databases for different departments and distributed geographically.
E-commerce demands integration of web databases with production systems.
Page 6
What is the Current Solution? Manual Integration Algorithms:
Allow designer to detect and resolve conflicts Manipulate information using semantic models
Knowledge bases/Artificial Intelligence: Cyc knowledge base and Carnot project
Global Dictionaries and Lexical Semantics: Wordnet, Clio, Summary schemas model Concept hierarchies (Castano)
Page 7
What is the Current Solution? (2) SQL and multidatabase query languages:
SQL, MSQL, IDL, DIRECT, SchemaSQL Requires user to understand DB structure & semantics
Wrapper and mediator systems: Information Manifold, TSIMMIS, Infomaster Use query languages or description logics Focus on query rewriting and reformulation
Industrial standards: XML, BizTalk, E-commerce portals Apply to limited domains/industries Require standard structures and database changes
Page 8
Previous Work Summary Current techniques for database integration have
some of these problems: Require integrator to understand all databases Integration process is manual Do not hide system complexity from the user Force changes on the existing database systems Construct global view manually Suffer from query imprecision (query containment)
Page 9
Our Approach Our approach combines standardization and
query mapping algorithms. The major idea is that schema conflicts can be
resolved if we: Eliminate all naming conflicts Define a language capable of determining schema
equivalence and performing transformations
Naming conflicts are eliminated by accepting a standard term dictionary.
Not a knowledge base or set of mediated views Leverages semantic information in English words
Integration Architecture
Architecture Components: 1) Integrated Context View
• user’s view of integration 2) X-Spec Editor
• stores schema & metadata• uses XML
3) Standard Dictionary• terms to express semantics
4) Integration Algorithm• combines X-Specs into integrated context view
5) Query Processor• accepts query on view• determines data source mappings and joins• executes queries and formats results
Local Transactions
X-Spec
X-Spec Editor
Standard Dictionary
Integration Algorithm
Integrated Context View
Query Processor and ODBC Manager
Database
Client
Subtransactions
Client
Multidatabase Layer
Database
X-Spec
Page 11
Architecture Components The architecture consists of four components:
A standard dictionary (SD) to capture data semantics SD terms are used to build semantic names describing
semantics of schema elements.
X-Specs for storing data semantics Database metadata and semantic names stored using XML
Integration Algorithm Matches concepts in different databases by semantic names. Produces an integrated view of all database concepts.
Query Processor Allows the user to formulate queries on the view. Translates from semantic names in integrated view to SQL
queries and integrates and formats results. Involves determining correct field and table mappings and
discovery of join conditions and join paths
Page 12
The integration architecture consists of three separate processes:
Capture process: independently extracts database schema information and metadata into a XML document called a X-Spec.
Integration process: combines X-Specs into a structurally-neutral hierarchy of database concepts called an integrated context view.
Query process: allows the user to formulate queries on the integrated view that are mapped by the query processor to structural queries (SQL) and the results are integrated and formatted.
Integration Processes
Page 13
Integration Architecture:The Capture Process
RelationalSchema
StandardDictionary
X-SpecSpecification
Editor
AutomaticExtraction
DBA Lookupof terms
Capture process involves: Automatically extracting the schema information and
metadata using a specification editor Assigning semantic names to each schema element
(tables and fields) to capture their semantics
Page 14
Architecture Components: The Standard Dictionary A standard dictionary (SD) provides
standardized terms to capture data semantics. Hierarchy of terms related by IS-A or Has-A links Contains base set of common database concepts, but
new concepts can be added
A SD term is a single, unambiguous semantic definition.
Several SD entries for a single English word are required if the word has multiple definitions.
The top-level dictionary terms are those proposed by Sowa.
Page 15
Architecture Components: Dictionary vs. Knowledge Base
The standard dictionary differs from a knowledge base such as Cyc because:
Not intended to be a general English dictionary or contain knowledge facts about the world Dictionary is evolved as new terms are required Not all English words are used
Dictionary provides the systems with no “knowledge” Since no facts are stored, system cannot deduce new facts Dictionary terms are just semantic place holders, integrators
determine the semantics of the database not the system
Simplified organization Dictionary is organized as a tree for efficiency and simplicity in
determining related concepts
Re-use of terms Terms are re-used in semantic names
Page 16
Architecture Components:Using the Standard Dictionary SD terms are used to build semantic names
describing semantics of schema elements. Semantic names have the form:
semantic name := [CT_Type] | [CT_Type] CN CT_Type := CT | CT {; CT} | CT {,CT} CT := context term, CN := concept name each CT and CN is a single term from the SD
Semantic names are included in specifications describing a database.
Page 17
Northwind & Southstorm Integration Example
Northwind Database SchemaTables Fields
Categories CategoryID, CategoryNameCustomers CustomerID, CompanyNameEmployees EmployeeID, LastName, FirstNameOrderDetails OrderID, ProductID, UnitPrice, QuantityOrder OrderID, CustomerID, EmployeeID, OrderDate,
ShipviaProducts ProductID, ProductName, SupplierID, CategoryIDShippers ShipperID, CompanyNameSuppliers SupplierID, CompanyName
Page 18
Northwind & Southstorm Integration Example (2)
Southstorm Database SchemaTables Fields
Orders_tb Order_num, Cust_name, Emp_name, Item1_id, Item1_qty,Item1_price, Item2_id, Item2_qty, Item2_price
Integration Example (3)Northwind Semantic Name Mappings
Type Semantic Name System Name Type Semantic Name System NameT [Category] Categories T [Order] OrdersF [Category] Id CategoryID F [Order] Id OrderIDF [Category] Name CategoryName F [Order;Customer] Id CustomerIDT [Customer] Customers F [Order;Employee] Id EmployeeIDF [Customer] Id CustomerID F [Order] Date OrderDateF [Customer] Name CompanyName F [Order;Shipper] Id ShipviaT [Employee] Employees T [Product] ProductsF [Employee] Id EmployeeID F [Product] Id ProductIDF [Employee] Last Name LastName F [Product] Name ProductNameF [Employee] First Name FirstName F [Product;Supplier] Id SupplierIDT [Order;Product] OrderDetails F [Product;Category] Id CategoryIDF [Order] Id OrderID T [Shipper] ShippersF [Order;Product] Id ProductID F [Shipper] Id ShipperIDF [Order;Product] Price UnitPrice F [Shipper] Name ShipperNameF [Order;Product]
QuantityQuantity T [Supplier] Suppliers
F [Supplier] Id SupplierIDF [Supplier] Name SupplierName
Page 19
Page 20
Northwind & Southstorm Integration Example (4)
Southstorm Semantic Name Mappings
Type Semantic Name System Name
Table [Order] Orders_tb
Field [Order] Id Order_num
Field [Order;Customer] Name Cust_name
Table [Order;Employee] Name Emp_name
Field [Order;Product] Id Item1_id
Field [Order;Product] Quantity Item1_qty
Table [Order;Product] Price Item1_price
Field [Order;Product] Id Item2_id
Field [Order;Product] Quantity Item2_qty
Field [Order;Product] Price Item2_price
Page 21
What is a semantic name? A semantic name is a universal, semantic identifier
in a domain. Similar to a field name in the Universal Relation. Semantics are guaranteed unique by construction. System has mechanism for comparing semantics across
domains even though it does not understand them. (Exploiting semantics in English words.)
Important definitions: context - a semantic name is a context if it maps to a table concept - a semantic name is a concept if it maps to a field context closure - of semantic name Si denoted Si
* is the set of semantic names produced by taking ordered subsets of the terms of Si = {T1, T2 , … TN} starting with T1.
Page 22
Architecture Components:X-Specs Database metadata and semantic names are
combined into specifications called X-Specs: Stored and transmitted using XML Contains information on a relational schema Organized into database, table, and field levels Stores semantic names to describe and integrate
schema elements
Southstorm X-Spec<?xml version="1.0" ?><Schema name = "Southstorm_xspec.xml” xmlns="urn:schemas-microsoft-com:xml-data" xmlns:dt="urn:schemas-microsoft-com:datatypes">
<ElementType name="[Order]" sys_name = "Orders_tb" sys_type="Table"> <element type = "[Order] Id" sys_name = "Order_num" sys_type = "Field"/> <element type = "[Order] Total Amount" sys_name = "Order_total" sys_type = "Field"/> <element type = "[Order;Customer] Name" sys_name = "Cust_name" sys_type = "Field"/> <element type = "[Order;Customer;Address] Address Line 1" sys_name="Cust_address" sys_type="Field"/> <element type = "[Order;Customer;Address] City" sys_name = "Cust_city" sys_type = "Field"/> <element type = "[Order;Customer;Address] Postal Code" sys_name="Cust_pc" sys_type="Field"/> <element type = "[Order;Customer;Address] Country" sys_name="Cust_country" sys_type="Field"/> <element type = "[Order;Product] Id" sys_name = "Item1_id" sys_type = "Field"/> <element type = "[Order;Product] Quantity" sys_name = "Item1_quantity" sys_type = "Field"/> <element type = "[Order;Product] Price" sys_name = "Item1_price" sys_type = "Field"/> <element type = "[Order;Product] Id" sys_name = "Item2_id" sys_type = "Field"/> <element type = "[Order;Product] Quantity" sys_name = "Item2_quantity" sys_type = "Field"/> <element type = "[Order;Product] Price" sys_name = "Item2_price" sys_type = "Field"/></ElementType></Schema>
Page 23
Page 24
Architecture Components:Integrating X-Specs Each database to be integrated is described
using a X-Spec. Identical concepts in different databases are
identified by similar semantic names. Concepts with identical (or hierarchially related)
semantic names are combined regardless of their physical representation in the individual databases.
Page 25
Integration Architecture:The Integration Process Integration process involves:
Automatically identifying identical concepts by matching semantic names
Constructing a global view of database concepts consisting of a hierarchy of concept terms
Resolving structural differences during query generation and submission (e.g. a concept may be represented as a table in one database and a field (attribute) in another)
Page 26
Integration Product:The Integrated Context View The product of the integration is a structurally-neutral
hierarchy of concepts called an integrated context view. Define a context view (CV) as follows:
If a semantic name Si is in CV, then for any Sj in Si*, Sj is also in CV.
For each semantic name Si in CV, there exists a set of zero or more mappings Mi that associate a schema element Ej with Si.
A semantic name Si can only occur once in the CV.
A context view (CV) is a valid Universal Relation. Each field is assigned a semantic name which uniquely
identifies its semantic connotation.
Page 27
Northwind & Southstorm Integration Example
Integrated Context ViewIntegrated View
TermData Source Mappings
(not visible to user)Integrated View
TermData Source Mappings
(not visible to user)
V (view root) N/A V (view root) (cont.) N/A
- [Category] NW.Categories - [Order] NW.Orders, SS.Orders_tb
- Id NW.Categories.CategoryID -Id NW.[Orders,OrderDetails].OrderID, SS.Orders_tb.Order_num
- Name NW.Categories.CategoryName - [Customer]
- [Customer] NW.Customers - Id NW.Orders.CustomerID
- Id NW.Customers.CustomerID - Name SS.Orders_tb.Cust_name
- Name NW.Customers.CompanyName - [Employee]
- [Employee] NW.Employees - Id NW.Orders.EmployeeID
- Id NW.Employees.EmployeeID - Name SS.Orders_tb.Emp_name
- [Name] - [Product] NW.OrderDetails
- First Name NW.Employees.FirstName - Id NW.OrderDetails.ProductID, SS.Orders_tb.Item[1,2]_id
- Last Name NW.Employees.LastName - Price NW.OrderDetails.UnitPrice, SS.Orders_tb.Item[1,2]_price
- [Product] NW.Products - Quantity NW.OrderDetails.Quantity, SS.Orders_tb.Item[1,2]_qty
- Id NW.Products.PrdouctID - [Shipper] NW.Shippers
- Name NW.Products.ProductName - Id NW.Shippers.ShipperID
- [Supplier] - Name NW.Shippers.ShipperName
- Id NW.Products.SupplierID - [Supplier] NW.Suppliers
- [Category] - Id NW.Suppliers.SupplierID
- Id NW.Products.CategoryID - Name NW.Suppliers.SupplierName
Page 28
Architecture Components: The Query Processor The query processor:
Allows the user to formulate queries on the view. Translates from semantic names in the context view to
structural queries (SQL) on databases. Involves determining correct field and table mappings and
discovery of join conditions and join paths
Retrieves query results and formats them for display to the user.
Client-side query processing: Perform joins between databases using common keys. Data value formatting and transformation
Page 29
The Query Processor:Determining field/table mappings For each database (D) in the context view
For each semantic name (S) in query If S has only one semantic name mapping in D Then
Add field mapping to query and its parent table
Else If S has multiple mappings but all in one table Then Add each field mapping to query and the parent table
Else S has multiple mappings in more than one table Then If any field mapping has a table already in query take that one Else take field mapping with best semantic name match Else take first mapping found
End If
Next
Next
Page 30
The Query Processor:Constructing Join Graphs Given a set of fields (F) and tables (T) to access,
joins are applied to connect the tables. A join graph is an undirected graph where:
Each node Ni is a table in the database. There is a link from node Ni to node Nj if there is a join
between the two tables.
A join path is a sequence of joins connecting two nodes in the graph.
A join tree is a set of joins connecting two or more nodes.
A join matrix M stores the shortest join paths between any two nodes (tables).
Page 31
The Query Processor:Join Graph for Northwind
O rders
O rderD etails
Em p loyeesProducts
C ustom ers
C ategoriesProducts
Products
Suppliers
Shippers
1 N
N
1
1N
1N1 N
1
N
1
N
Products
Page 32
The Query Processor:Join Discovery Results Join discovery in a database with a connected,
acyclic join graph and a join matrix M: There exists only one join tree for any set of tables . The joins required to connect a table set T is found by
taking any Ti of T and unioning the join paths in M[Ni,N1], M[Ni,N2], ... M[Ni,Nn] where N1,N2,..Nn are the nodes corresponding to the set of tables T.
For a cyclic join graph: There may exist more than one join tree for a set of
tables and each tree may have different semantics. Can allow the user to uniquely determine join tree by
graphically displaying join conditions to the user as they browse the context view.
Page 33
Advanced Query Processing Advanced query processor features include:
global keys and joins - a mechanism for specifying when a field stores a global key such as a social security number.
result normalization - a procedure for normalizing query results returned from each individual database. (e.g. Southstorm)
data integration - transforming data representational conflicts at the global level. For example, “M” and “F” may represent “Male” and
“Female” in one database, and another may represent these concepts using “0” and “1”.
Page 34
Northwind & Southstorm Query Examples Example 1: Retrieve all order ids ([Order] Id) and
customers ([Customer] Name): SS: SELECT Order_num, Cust_name FROM Orders_tb NW: SELECT OrderID, CompanyName FROM Orders,
Customers WHERE Orders.CustomerID = Customers.CustomerID
Example 2: Retrieve all ordered products ([Order;Product] Id) and their order ids.
SS: SELECT Order_num, Item1_id, Item2_id FROM Orders_tb NW: SELECT OrderID, ProductID FROM OrderDetails Note: In NW, selects from two different order id mappings. In
SS, result normalization is required.
Page 35
Integration Example:Discussion Important points:
System table and field names are not presented to the user who queries based on semantic names.
Database structure is not shown to the user. Field and table mappings are automatically
determined based on X-Spec information. Join conditions are inserted as needed when available
to join tables. Different physical representations for the same
concept are combined. Hierarchically related concepts are combined based
on their IS-A relationship in the standard dictionary.
Page 36
Unity Overview Unity is a software package that implements the
integration architecture with a GUI. Developed using Microsoft Visual C++ 6 and
Microsoft Foundation Classes (MFC). Unity allows the user to:
Construct and modify standard dictionaries Build X-Specs to describe data sources Integrate X-Specs into an integrated view Transparently query integrated systems using ODBC
and automatically generate SQL transactions
Unity is available for demonstration and distribution.
Page 41
Architecture Discussion The architecture automatically integrates relational
schemas into a multidatabase. Desirable properties:
Individual mappings - information sources integrated one-at-a-time and independently
Integrated view constructed for query transparency - user queries system by semantics instead of structure
Handles schema conflicts - including semantic, structural, and naming conflicts
Automated integration - integrated view constructed efficiently and automatically
No wrapper or mediator software is required Transparent querying - users issue semantic queries which
are translated to SQL by the query processor
Page 42
Contributions Architecture contributions:
Has an unique application of a standard dictionary which is not a knowledge base
Separates the capture and integration processes Allows transparent querying without structure Provides algorithms for dynamically extracting
database data (creating relevant views) Algorithms for mediation of global level conflicts
(global keys, normalization, etc.) Arguably simpler method for capturing data
semantics than using description logic An implementation, Unity, which demonstrates the
practical benefits of the architecture
Page 43
Conclusions Automatic database integration is possible by
using a standard term dictionary and defining semantic names for schema elements.
Integration of data sources has applications to the WWW and construction of data warehouses.
Users are able to transparently query integrated systems by concept instead of structure.
Page 44
Future Work The integration architecture is evolving with
standards on XML and captures metadata information in XML documents.
We are constantly refining Unity. Develop an integration component for a web browser
The query processor is being extended to resolve more complex queries and conflicts.
Test the system in large industrial projects. Allow distributed updates and global updates on
all databases.
Page 45
References Publications:
Unity - A Database Integration Tool, R. Lawrence and K. Barker, TRLabs Emerging Technology Bulletin, January 2000.
Multidatabase Querying by Context, R. Lawrence and K. Barker, DataSem2000, pages 127-136, Oct. 2000.
Integrating Relational Database Schemas using a Standardized Dictionary, To appear in SAC’2001 - ACM Symposium on Applied Computing, March, 2001.
Sponsors: NSERC, TRLabs
Further Information: http://www.cs.umanitoba.ca/~umlawren/
Recommended