52
Mohammed AlSolh Supervised By: Dr.Ghassan Qadah Survey On Temporal Data And Change Management in Data Warehouses

Survey On Temporal Data And Change Management in Data Warehouses

  • Upload
    alsolh

  • View
    330

  • Download
    1

Embed Size (px)

DESCRIPTION

A Survey on the following papers: [22] Felix Naumann & Stefano Rizzi 2013 – Fusion Cubes [30] MaSM: Efficient Online Updates in Data Warehouses [37] Managing Late Measurements In Data Warehouses (Matteo Golfarelli & Stefano Rizzi 2007) [38] A SchemaGuide for Accelerating the View Adaptation Process [35] Temporal Query Processing in Teradata [33] Toward Propagating the Evolution of Data Warehouse on Data Marts [31] Wrembel and Bebel (2007)

Citation preview

Page 1: Survey On Temporal Data And Change Management in Data Warehouses

Mohammed AlSolhSupervised By: Dr.Ghassan Qadah

Survey On Temporal Data And Change Management in Data Warehouses

Page 2: Survey On Temporal Data And Change Management in Data Warehouses

Outline

1Terminologies 2Explore

Methods

3Conclusion

Page 3: Survey On Temporal Data And Change Management in Data Warehouses

Terminologies1

Page 4: Survey On Temporal Data And Change Management in Data Warehouses

Data Warehouse & Data Marts• a data warehouse contains data from several databases maintained by different business

units with historical and summary information• It is a database used for reporting and data analysis where the data is arranged into

hierarchical groups often called dimensions and represented into facts and aggregate facts• Data warehouses can be subdivided into data marts. Data marts store subsets of data from

a warehouse

Page 5: Survey On Temporal Data And Change Management in Data Warehouses

Temporal database

Temporal databaseis a database with built-in support for handling data involving time, temporal data is data that keeps track of changes over time

Contains the following attributes• Valid time • Transaction time• Bitemporal data

combines both Valid and Transaction Time

Page 6: Survey On Temporal Data And Change Management in Data Warehouses

Multidimensional data model

The multidimensional data model is designed to solve complex queries in real time, it produces a Cube which is like a 3d spreadsheet, It represents its information in dimensions & facts and they are usually maintained in a star schema

Page 7: Survey On Temporal Data And Change Management in Data Warehouses

Multidimensional Model Terms

• Fact– A business performance

measurement, typically numeric and additive,

• Dimension– Is an object that includes

attributes allowing the user to explore the measures from different perspectives of analysis

• Measures– Are the numeric records stored

in the fact table

• Hierarchy– Is a collection of

dimensions that form a hierarchy, such as country/city/state

• Property– Are additional

descriptive attributes to a dimension

Page 8: Survey On Temporal Data And Change Management in Data Warehouses

Multidimensional Model Operations

• Drill-up/Drill-Down: Moving from summary category to individual categories and vice versa

• Roll Up cities were vertically and products horizontally, after the roll up it has swapped both dimensions

Page 9: Survey On Temporal Data And Change Management in Data Warehouses

Schema & Data Changes

• Databases Schema Changes– With loss of data by simply

changing the schema– Without loss of data but the

data is evolved with keeping the attributes of the old schema (evolution)

– Without loss of data by changing schemas with keeping old versions of the old schemas (versioning)

• Data Changes in warehouses– Transient Data by deletions and

updates without maintaining the old data

– Periodic Data which handles deletions and updates by adding new records

– Semi Periodic is same as periodic but keeps only a recent collection of changes

– Keeping snapshots of complete data which is popular in datamarts

Page 10: Survey On Temporal Data And Change Management in Data Warehouses

Materialized Views

• View Maintenance:– The process of updating a

materialized view in response to changes to the underlying data

• View adaption– View adaptation aims to

leverage the previously materialized view to generate the new view, since the cost of rebuilding the materialized view from scratch may be expensive

• Materialized view – Is a stored view query

result which is like a cache

– Used by query optimizer to speed up querying

Page 11: Survey On Temporal Data And Change Management in Data Warehouses

2 Explore Methods

Page 12: Survey On Temporal Data And Change Management in Data Warehouses

View Adaptation

Page 13: Survey On Temporal Data And Change Management in Data Warehouses

A SchemaGuide for Accelerating the View Adaptation Process

• an efficient process for view adaptation in XML Databases upon the fragment-based view representation by segmenting materialized data into fragments and developing algorithms to update only those materialized fragments that have affected by the view definition changes

– Their Adaption process• Calling an optimized containment

check for the most suitable fragment that contain the requested fragment

• Adapt the XFM structure to the fragments found

• find a materialized fragment that is affected by the change

• search for existing materialized fragments that can be reused and mapped to the affected materialized fragment

– It has shown significant improvement by reducing up to 2.6% of recomposing the materialized view

Page 14: Survey On Temporal Data And Change Management in Data Warehouses

Keeping the data warehouse in sync with the sources

Page 15: Survey On Temporal Data And Change Management in Data Warehouses

Multi-version Data Warehouse

1. Automatic detection of structural and content changes in the data sources and reflection on the data warehouse by keeping a sequence of persistent versions

• Their Solution Supports– monitoring External data sources with

respect to content and structural changes

– automatic generation of processes monitoring External Data Sources

– applying discovered External Data Sources changes to a selected DW version

– describing the structure of every DW version

– querying multiple DW versions at the same time and presenting the results coming from multiple versions

– visualizing the schema

Page 16: Survey On Temporal Data And Change Management in Data Warehouses

MaSM (Materialized Sort Merge)

2. Efficient Online Updates in Data Warehouses• an approach for supporting online updates by making use of

SSDs to cache incoming updates• model the problem of query processing with differential

updates as a type of outer join between the data residing on disks and the updates residing on SSDs

• present algorithms for performing such joins and periodic migrations, as for example The updates are migrated to disks only when the system load is low or when updates reach a certain threshold (e.g., 90%) of the SSD size

Page 17: Survey On Temporal Data And Change Management in Data Warehouses

Data changes in the data mart

Page 18: Survey On Temporal Data And Change Management in Data Warehouses

Data changes in the data mart

• changes can be first updated in the warehouse then data marts under it

• Changes in data mart are segregated to– dimensional data

changes– factual data changes– schema changes

Page 19: Survey On Temporal Data And Change Management in Data Warehouses

Dimensional Data Changes

• Which are Changes in a hierarchy, Can be either a dimension or a level or a property

• Kimball Proposes Three solutions to changes in ROLAP multidimensional models

Page 20: Survey On Temporal Data And Change Management in Data Warehouses

ROLAP multidimensional models

• In the Type I solution he simply proposes to overwrite old tuples in dimension tables with new data. The problem is you cannot track changes but keeps the data mart up to date

• Type II solution, each change produces a new record in the dimension table. Surrogate keys must be used, you can keep and track changes along with the new data

• the Type III solution is based on augmenting the schema of the dimension table by representing both the current and the previous value for each level or attribute subject to change

• Keeping the complete history TypeVI (I+II+III) the more data you keep in hierarchy the more expensive, you need to keep additional timestamps

Page 21: Survey On Temporal Data And Change Management in Data Warehouses

Changes in Factual data

• Examples of changes happens for such cases like errors in measurements such as levels of the sea were captured incorrectly and fixed later

• The facts are classified based on the conceptual role to Flow facts & Stock facts

Page 22: Survey On Temporal Data And Change Management in Data Warehouses

Managing Late Measurements In Data Warehouses

• a proposal to couple valid time and transaction time and distinguish two different solutions for managing late measurements

• Flow Model - delta solution, where each new measurement for an event is represented as a delta (current registration – previous registration) with respect to the previous measurement and transaction time is modeled by adding to the schema a new temporal dimension that models the valid time to represent when each registration was made in the data mart, queries are answered by summing for each event all registrations, historical queries are answered by selectively summing all registrations for an event for the time queried

• Stock Model - consolidated solution, where late measurements are represented by recording the consolidated value for the event by using 2 timestamps, and transaction time is modeled by two temporal dimensions that delimit the time interval during which each registration is current, like the currency and its time interval

Page 23: Survey On Temporal Data And Change Management in Data Warehouses

Schema changes in the data Mart

Page 24: Survey On Temporal Data And Change Management in Data Warehouses

Managing Late Measurements In Data Warehouses

• 2 approaches are followed for handling schema changes:– schema evolution, maintain old information without data

loss but loss of old schema– schema versioning, separate versions are stored and user

can access different schema versions

Page 25: Survey On Temporal Data And Change Management in Data Warehouses

Schema Evolution

• Propagating the Evolution of Data Warehouse on Data Marts

• operators to support changing the data mart schema– evolution operators for the data warehouse

• basic operations and composite operations

– evolution operators for the data mart• mapping function

– a set of rules for the evolution

Page 26: Survey On Temporal Data And Change Management in Data Warehouses

Propagating the Evolution of Data Warehouse on Data Marts

• This mapping function is embedded into the Extract Transform Load process from the data warehouse to the datamart, these functions are for example:

• Fact(Table): returns a set of facts from the data warehouse tables

• Dim(Table): returns a superset of dimensions from the data warehouse table, each superset contains all dimensions of the data mart cube

Page 27: Survey On Temporal Data And Change Management in Data Warehouses

Propagating the Evolution of Data Warehouse on Data Marts

Propagation operations• Add_Dim(Dname, Fi, T): adds a new dimention

which will be named as “Dname” to the data mart fact “Fi”, it will take the primary key of table T from the data warehouse and a subset of textual attributes contained in T

• Add_Fact(Fname, T, set(D)): adds a new fact “Fname” with dimensions set(D), the fact measures are the numeric attributes of T

Page 28: Survey On Temporal Data And Change Management in Data Warehouses

Propagating the Evolution of Data Warehouse on Data Marts

A set of rules applies on the data warehouse to data mart mapping process. Such as:• If a table T to be added to the data warehouse has foreign keys in

another table existing in the data warehouse that concerns a fact, the table will add a new dimension to the for the fact with attributes of the table to be the attributes of the dimension

• If T doesn’t have foreign keys in another table in the data warehouse and has different foreign keys pointing to other tables that loads dimensions in the datamart and T has numeric attributes, then T will probably create a new fact

Note: Commercially, SQL Compare & Oracle Change Management Pack supports evolving schemas compare and generate scripts

Page 29: Survey On Temporal Data And Change Management in Data Warehouses

Schema Versioning

• Decision makers may have built their decision on an old schema and changes appeared after their executing their queries which also may have measure changes, to run the same query again and produce the same result, non-volatility is required. With changes at the schema level there has to be some versioning approaches

Page 30: Survey On Temporal Data And Change Management in Data Warehouses

Schema Versioning

• A comprehensive approach to versioning is presented in the multiversion data warehouse [31], they propose two metamodels: one for managing a multi-version data mart and one for detecting changes in the operational sources, along with “real” versions which are versions used in the application domain, also “alternative” versions are introduced which are used for simulating and managing hypothetical business scenarios within what-if analysis settings

• Commercially, several database management systems (DBMSs) offer support for valid and transaction time: Oracle 11g, IBM DB2 10 for z/OS, and Teradata Database 14. Part 2 (SQL Foundation) of SQL:2011 was just released

Page 31: Survey On Temporal Data And Change Management in Data Warehouses

Querying temporal data

• Cross Version Querying– Multiversion Data Warehouse it allows users to specify

either specify a time interval for a data warehouse or specify versions to query

• Temporal Querying– Temporal Queries On TerraData

• Native Temporal Implementation• Rewriting Approach

Page 32: Survey On Temporal Data And Change Management in Data Warehouses

Querying temporal data

Disadvantages of native approach– Since temporal data is stored in a new data type, SQL

execution code needs to be modified for joins and aggregation on temporal data

– Query optimization needs to be adapted to the new temporal tables

– Some duplications might occur in code of functions of a DBMS to support temporal data

Page 33: Survey On Temporal Data And Change Management in Data Warehouses

Rewrite approach

– There is no impact on execution code– There is a small impact on the optimizer– No duplication as it will add a step before the query

optimizer– But it will add complexity to the query structure

Page 34: Survey On Temporal Data And Change Management in Data Warehouses

Rewrite approach

– Rewrites will modify projection, selection & join• Select * for example will exclude the time dimension if the qualifier

is CURRENT• For CURRENT & SEQUENCED qualifiers, it will add time predicate

respectively• For Join, temporal qualifiers are applied before the join, For inner

join it works, but for outer joins we need to apply the qualifiers on each table separately then from the derived tables we perform the join

– They showed in their study that rewrites were only adding 5% to the execution time of the query in comparison to the native implementation

Page 35: Survey On Temporal Data And Change Management in Data Warehouses

Fusion Cubes

Page 36: Survey On Temporal Data And Change Management in Data Warehouses

Fusion Cubes

• A framework to support self-service business intelligence in multidimensional cubes that can be dynamically extended both in their schema and their instances

• it can include both stationary and situational data

Page 37: Survey On Temporal Data And Change Management in Data Warehouses

Situational Query

• A user poses an OLAP-like situational query, one that cannot be answered on stationary data only

• The system discovers potentially relevant data sources • The system fetches relevant situational data from selected

sources • The system integrates situational data with the user’s data, if

any • The system visualizes the results and the user employs them for

making her decision • The user stores and shares the results

Page 38: Survey On Temporal Data And Change Management in Data Warehouses

Situational Query

Page 39: Survey On Temporal Data And Change Management in Data Warehouses

fusion cube architecture

Page 40: Survey On Temporal Data And Change Management in Data Warehouses

Situational Query

• integration of external data sources can be in– RDF format to be integrated to the data warehouse on the fly– Social networks, there is an implantation called

MicroStrategy that analyses social networks– Blogs where we can do “opinion mining”, but to integrate it,

it is challenging because the data is unstructured

Page 41: Survey On Temporal Data And Change Management in Data Warehouses

Drill Beyond

• Drill-Beyond operator can go beyond:– The schema - A user can click on a dimension or a fact that is

not available– The instances – a user can request for new instance for an

attribute, such as a new country so the values will be retrieved

• Query formulation can involve different technologies for situational data:– SPARQL for querying RDF data– Use of Web APIs that provides data in XML or JSON format

Page 42: Survey On Temporal Data And Change Management in Data Warehouses

Integration

• Once the data is available, it has to be integrated with the stationary data to formulate the fusion cubes– Extract the structure of different situational data

• Google fusion tables (Gonzalez et al., 2010) offers cloud-based storage of basic relational tables that can be shared with others, annotated, fused with other data, this helps extracting relations from unstructured data

– Map the schema of data sources• XFM for example

– Reconcile with stationary data to formulate the fusion cube• Google Refine (code.google.com/p/google-refine/), which provides functionalities

for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases

Page 43: Survey On Temporal Data And Change Management in Data Warehouses

Support On Commercial Systems

• There are many commercial systems available to support the fusion cubes:

• illo system (illo.com) allows non-business users to store small cubes in a cloud-based environment, analyze the data using powerful mechanisms, and provide visual analysis results that can be shared with others and annotated

• The Stratosphere project (www.stratosphere.eu), which explores the power of massively parallel computing for Big Data analytics

Page 44: Survey On Temporal Data And Change Management in Data Warehouses

Support On Commercial Systems

Page 45: Survey On Temporal Data And Change Management in Data Warehouses

3 Conclusion

Page 46: Survey On Temporal Data And Change Management in Data Warehouses

Conclusion

• We have explored different methodologies for handling temporal data and changes of schema, factual data, dimensional data in Data Warehouses, researches are coming with bright ideas helping different scenarios to facilitate dynamic features in a Data Warehouse and speed up its query processing. We encourage commercial systems to look into these new methodologies and make it available for practical use to the public

Page 47: Survey On Temporal Data And Change Management in Data Warehouses

Conclusion

Method

ROLAP multidimensional models

A SchemaGuide for Accelerating the View Adaptation Process

Multi-version Data Warehouse

MaSM (Materialized Sort Merge)

Managing Late Measurements In Data Warehouses

Propagating the Evolution of Data Warehouse on Data Marts

Temporal Queries On TerraData

Fusion Cubes

View Adaption YES YES

DW in sync with the sources YES YES YES

Changes in the data mart

dimensional data changes

YES YES

factual data changes YES YES

schema changes

schema evolution

YES YES

schema versioning

YES

Software Prototype

YES YES YES

Page 48: Survey On Temporal Data And Change Management in Data Warehouses

Issues & Future Work

• There is a need to formulate a query that can span different data base schema and produce results at once we suggest an approach to use an extra attribute in the data warehouse to store heterogeneous an non-normalized data in an XML format which is highly extensible.

• The data warehouse design needs to have dynamic updates in schema without physically modifying or replicating the current store, as above an XML extension can help in storing attributes that seem to be changing frequently, although it might be a drawback in performance there can be an approach to identify fixed dimensions and properties that can be structured and queried appropriately, then there can be a second phase to query processing where XML data can be explored and embedded in the multidimensional data form

• Queries that span across different versions in the scope of version based schema can be time consuming, either a revision to restructure versioning can be done to store unchanged values in a different version that the changed values it can significantly reduce the space and time requirements and be more feasible in querying

Page 49: Survey On Temporal Data And Change Management in Data Warehouses

Main References

• [22] Felix Naumann & Stefano Rizzi 2013 – Fusion Cubes• [30] MaSM: Efficient Online Updates in Data Warehouses

http://www.cs.cmu.edu/~chensm/papers/MaSM-sigmod11.pdf• [37] Managing Late Measurements In Data Warehouses (Matteo Golfarelli &

Stefano Rizzi 2007)• [38] A SchemaGuide for Accelerating the View Adaptation Process (Jun Liu1,

Mark Roantree1, and Zohra Bellahsene2 2010)• [35] Temporal Query Processing in Teradata (Mohammed Al-Kateb, Ahmad

Ghazal, Alain Crolotte 2013)• [33] Toward Propagating the Evolution of Data Warehouse on Data Marts

(Saïd Taktak and Jamel Feki, 2012)• [31] Wrembel and Bebel (2007)

Page 50: Survey On Temporal Data And Change Management in Data Warehouses

Supporting References

• [1] Ramakrishnan (DBMS 3rd ed) Chapter 25• [2] http://en.wikipedia.org/wiki/Data_warehouse• [3] Introduction to Information Systems (Marakas & O'Brien 2009)• [4] Kimball, The Data Warehouse Toolkit 2nd Ed (2002) Chapter 1• [5] http://en.wikipedia.org/wiki/Temporal_database• [6] http://www.olapcouncil.org/research/glossaryly.htm• [7] http://en.wikipedia.org/wiki/OLAP_cube• [8] http://docs.oracle.com/cd/B12037_01/olap.101/b10333/multimodel.htm• [9] Multidimensional Database Technology: Bach Pedersen, Torben; S. Jensen, Christian (December 2001).• [10] TSQL2 Language Specification https://cs.arizona.edu/~rts/initiatives/tsql2/finalspec.pdf• [11] Sybase Infocenter

http://infocenter.sybase.com/help/index.jsp?topic=/com.sybase.infocenter.dc00269.1571/doc/html/bde1279401694270.html• [12] (Roddick, 1995)• [13] (Grandi, 2002)• [15] (Devlin, 1997)• [16] "Information technology -- Database languages -- SQL -- Part 2: Foundation (SQL/Foundation)," International Standards Organization,

December 2011• [18] Gupta Maintenance of Materialized Views: Problems, Techniques, and Applications• [20] De Amo & Halfeld Ferrari Alves (2000)• [21] Avoiding re-computation: View adaptation in data warehouses (1997) M Mohania• [23] http://en.wikipedia.org/wiki/Resource_Description_Framework

Page 51: Survey On Temporal Data And Change Management in Data Warehouses

? Questions?

Page 52: Survey On Temporal Data And Change Management in Data Warehouses

Thank You!