37
Copyright, Harris Corporation & Ophir Frieder, 1998 1 Very Large Databases And Data Warehousing

Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Embed Size (px)

Citation preview

Page 1: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 19981

Very Large DatabasesAnd

Data Warehousing

Page 2: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 19982

Objectives

• Learn how space is used within a database.

• Learn about different types of database usage patterns.

• Define a variety of terms and phrases commonly used in the context of Very Large Databases (VLDBs).

• Discuss vendor capabilities with regard to VLDBs.

Page 3: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 19983

Space Usage

• Consider the following all too typical sentences:

“Our system has been successfully demonstrated on database systems as large as 1TB”

“We at <XYZ company> have extensive experience with very large databases, i.e., greater than 500GB.”

“The <XYZ system> shall be capable of executing insert queries into a database of size at least 10GB within 1.5 seconds.”

Page 4: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 19984

Space Usage, Cont.

• Such sentences are imprecise and misleading for a variety of reasons.

• What do the following sentence fragments mean?“...database systems as large as 1TB.”

“...with very large databases, i.e., greater than 500GB.?

“...database of size at least 10GB.”

Page 5: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 19985

Space Usage, Cont.

• When interpreting such numbers, the following must be considered:

Platform Size vs. Data Size: It is typical for the supporting platform to have significantly more storage space than that required by the actual data set (5x or more). The extra space is used for a variety of reasons; mirroring, logs, temporary storage, or performance.

Index Size: An index on a table may take considerable space, even more than the table, depending on characteristics of the index; number and types of columns, and padding in index pages.

Page 6: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 19986

Space Usage, Cont.

Use of Extra Space: Padding disk pages can reduce lock contention, thereby improving performance. Joins and sorts typically require “large” temporary storage areas.

Redundant Data: De-normalizing (for performance) will typically introduce redundant data. Mirroring data (for integrity) will duplicate data.

Page 7: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 19987

Space Usage

• Note that all of these are all legitimate uses of space.

• However, frequently it is typical to try and get a measure of the “core” data size, which would exclude indices, free space, and redundancy.

• Lesson => define your terms, and ask questions!

Page 8: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 19988

Database Types& Usage Patterns

– On-Line Transaction Processing (OLTP)

– Decision Support System (DSS)

– Data Warehousing

– Data Mart

– On-Line Analytical Processing (OLAP)

– Data Mining

– Very Large Databases (VLDBs)

Page 9: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 19989

OLTP & DSS

On-Line Transaction Processing: OLTP applications are update-intensive and generally consist of shorter transactions that access a small portion of a database, often through a primary key or index.

Decision Support System: DSS applications typically consist of long and often complex read-only queries that access large portions of the database.

– Transaction Processing Performance Council

Page 10: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199810

Data Warehousing

A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management’s decision needs. Data entering a data warehouse comes from an operational environment in almost every case. The data warehouse is always a physically separate store of data transformed from the application data found in the operational environment. In addition, there are four levels of data in the warehouse: old detail, current detail, lightly summarized data, and highly summarized data.

– W.H. Inman, the “father” of the data warehousing concept

Page 11: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199811

Data Warehousing, Cont.

Data Warehouse

Source Source Source

Integration

Source

DSS User DSS User

Page 12: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199812

Data Warehousing, Cont.

• Purpose:– To separate OLTP and DSS operations into different systems, thereby

allowing each to be tuned appropriately.

• Data Sources:– A data warehouse typically has several sources of data, often times OLTP

systems.

– Sources do not have to be homogenous (different/same data, different/same formats, etc.)

Page 13: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199813

Data Warehousing, Cont.

• Data Integration:

– Integration is done periodically, not continuously.

– Data is filtered; not all data is used by the DSS analyst.

– Data is put into a consistent format.

– Data is often times simply appended.

– Data is typically given a time-stamp; when the data was integrated, or when it was valid.

– Once in the warehouse, data is typically not modified.

Page 14: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199814

Data Warehousing, Cont.

• Internal Components:– Current Detail Data: The most recently made copy of data from the

operational environment (may in fact be relatively old).

– Old Detail Data: As new data is copied to the current detail data segment, the previous contents of the current detail data are copied to this segment. Perhaps on a cheaper medium.

– Lightly Summarized Data: In anticipation of queries by DSS analysts, current detail data is summarized (e.g., regional sales by week).

– Highly Summarized Data: A further refinement of lightly summarized data (e.g., national sales by month).

• Note that these are basically forms of de-normalization.

Page 15: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199815

Data Warehousing, Cont.

• Subject Orientation:

– Many operational systems are structured around functions or transactions.

– A system for bank tellers might be structured around major functions such as opening & closing accounts, granting loans, adding & deleting customers, withdrawing and depositing money, etc.

– A system for bank managers might be structured around assets and liabilities; accounts, loans, investments, customers.

Page 16: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199816

Data Mart

A data mart is a body of DSS data for a department that has an architectural foundation of a data warehouse.

– W.H. Inman

Page 17: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199817

Data Mart, Cont.

• Purpose:– To eliminate many of the problems resulting from the size, and from the

number of uses of many data warehouses.

• Data Sources:– A data warehouse is typically the source of data for a data mart.

• Data Integration:– Data is filtered, customized, and refined for use by a specific

“department.”

Page 18: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199818

Data Mart, Cont.

• Relative To A Data Warehouse, A Data Mart Is Typically:– smaller

– has fewer users

– easier to manipulate and analyze

Page 19: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199819

On-Line Analytical Processing

On-Line Analytical Processing (OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access to a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user.

– The OLAP Council

Page 20: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199820

On-Line AnalyticalProcessing, Cont.

• OLAP Functionality Is Characterized By:– Multi-dimensional analysis of data.

– Supports end-user analytical and navigational activities.

– Calculations and modeling applied across dimensions, through hierarchies and/or across members.

– Trend analysis over sequential time periods.

– Analysis of historical and projected data in various “what-if” scenarios.

– Implies, at least to a certain extent, a large or very large repository that has been configured for DSS applications.

Page 21: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199821

Data Mining

Historically, also known as Knowledge Discovery in Databases (KDD), which is defined as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data.

– Frawley, Piatetsk-Shapiro, and Matheus

Page 22: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199822

Data Mining, Cont.

Given a set of facts (data) F, a language L, and some measure of certainty C, we define a pattern as a statement S in L that describes relationships among a subset Fs of F with a certainty c, such that S is simpler (in some sense) than the enumeration of all facts in Fs. A pattern that is interesting (according to a user-imposed interest measure) and certain enough (again according to the user’s criteria) is called knowledge. The output of a program that monitors the set of facts in a database and produces patterns in this sense is discovered knowledge.

– Frawley, Piatetsk-Shapiro, and Matheus

Page 23: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199823

Data Mining, Cont.

• Numerous definitions exist, most share the following similarities:– Data is too large to manually analyze; typically data is in a warehouse or

similar large repository.

– Data is being analyzed in a DSS or OLAP manner.

– The goal of data mining is to extract rules or patterns that are easily understood by a person.

– Patterns are deduced from data, and then presented to a user for evaluation.

– Assumes that specialized tools are used to extract the patterns.

Page 24: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199824

Very Large Databases(VLDBs)

• Question:– What constitutes a very large database?

– For that matter, what constitutes a small, medium or large database?

• Surveys And Studies:– The Gartner Group (‘96)

– Transaction Processing Performance Council (TPC)

– Winter Corporation

Page 25: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199825

The Gartner Group

– Oracle Magazine, November/December 1996 (sizes given in gigabytes).

OLTP Data WarehouseSmall 0-10 0-30

Medium 10-50 30-100Large 50-200 100-500

Very Large 200+ 500+

Page 26: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199826

Winter CorporationSurvey - 1998

• OLTP Systems:

– IBM S/390 running DB2 dominates in database size, most rows/records, and, to a lesser extend, peak on-line activity for OLTP systems.

– The largest OLTP database, which is 16.8TB in total size (11.3TB for data and 5.5TB for indices), belongs to United Parcel Service (UPS). This system resides on an IBM S/390 running DB2.

– Compared with other environments, OLTP systems in Unix environments were significantly smaller, the largest being 600GB.

Page 27: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199827

Winter CorporationSurvey - 1998, Cont.

• DSS Systems:

– The largest database, in terms of both total database size and number of rows/records, belongs to The Dialog Corporation. This system resides on Hitachi and Sun platforms running a proprietary database system. The database is 6.3TB in size and has 1.5 trillion rows/records.

– For DSS databases, eleven of the top ten largest were in Unix environments.

– The database with the top peak on-line activity is owned by JC Penny. The system resides on an NCR 5100M running a Taradata database management system, and executes as many as 784 concurrent queries.

Page 28: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199828

Winter CorporationSurvey - 1998, Cont.

• No Shows:

– There were no reports of a database over 542GB on NT.

– None of the top-ten contenders in any category made use of Sybase.

– No databases were reported that used object-oriented systems.

Page 29: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199829

Winter CorporationSurvey - 1998, Cont.

• Oracle & Informix:

– Outside of Unix, systems using Oracle and Informix were significantly smaller - they never appeared in the top three of any other category.

– Oracle and Informix do much better in all three categories (size, rows/records, and peak activity) for Unix systems.

Page 30: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199830

Transaction Processing Performance Council (TPC)

• TPC is an organization whose purpose is to develop and maintain database

benchmarking materials.

• TPC also enforces a relatively strict process for using their materials.

• TPC materials are used and referenced heavily by virtually all major database hardware and software vendors.

• Results of the benchmarks are published by TPC, and updated regularly as new results are submitted.

• Not surprisingly, TPC benchmark results also appear regularly in numerous vendor press releases.

Page 31: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199831

Transaction Processing Performance Council (TPC)

• TPC has developed several different sets of benchmark materials. Two of the most frequently referenced are TPC-C and TPC-D, which are OLTP and DSS benchmarks, respectively.

• Currently, results reported for the TPC-D benchmark include databases consisting of 30GB, 100GB, 300GB, 1TB, and 3TB of data.

• Historically, TPC has not interpreted or ranked the results. However, such a ranking is published by Ideas International, and can be accessed at www.ideasinternational.com).

Page 32: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199832

Transaction Processing Performance Council (TPC)

• TPC Metrics:

– Power Metric: A measure of the raw query execution power of the system when connected with a single active user.

– Throughput Metric: A measure of the ability of the system to process the most queries in the least amount of time.

– Price/Performance Metric: The ratio between total system price and the query-per-hour-rating, which is a composite of the power and throughput metrics.

Page 33: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199833

Highlights From TheTPC-D Results As Of 4/21/98

• TPC-D 100GB:

– NCR WorldMark 4700 running a Teradata database had the highest ranking on power and throughput.

– IBM/6000 running DB2 had the three highest scores in price/performance.

– IBM/6000 running DB2 also came in third and fourth, for both power and throughput.

– Although it did not capture the top spot in any of the three categories, Oracle, running on a variety of platforms, captured six of the top ten spots on power, five of the top ten spots on throughput, and five of the top ten spots on price/performance.

– Informix ranked within the top ten in each category, appearing in the sixth position on power, the ninth position on throughput, and the sixth and tenth positions on price/performance.

Page 34: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199834

Transaction Processing Performance Council (TPC)

• TPC-D 300GB:

– NCR WorldMark 5150 running a Taradata database had the highest ranking on power and throughput.

– DG AViiON AV20000 running Oracle came in first on price/performance.

– As with the 100GB category, Oracle appeared very strong in all three categories, capturing six of the top ten spots in power, seven of the top ten spots on throughput, and six of the top ten spots in price/performance.

– Informix appeared slightly stronger in the 300GB category, capturing three of the top ten spots in power, throughput, and price/performance.

Page 35: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199835

Transaction Processing Performance Council (TPC)

• TPC-D 1TB:

– Sun Ultra Enterprise 6000 running an INFORMIX Dynamic server captured the top spot in all three metrics; power, throughput, and price/performance.

– Although they did not capture the top spot on any of the three metrics, both NCR WorldMark 5150 running Teradata, and IBM RS/600 running DB2 both appeared within the top four in all three metrics.

Page 36: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199836

Transaction Processing Performance Council (TPC)

• Other Notes:

– For these TPC-D results, twenty competitors submitted at the 100GB level, twelve submitted at 300GB, and four submitted at 1TB.

– Although several entries using Sybase were submitted to the TPC-C benchmark, none were submitted for TPC-D.

– No results were submitted, for either TPC-C or TPC-D, using purely object-oriented systems such as ObjectStore.

Page 37: Copyright, Harris Corporation & Ophir Frieder, 19981 Very Large Databases And Data Warehousing

Copyright, Harris Corporation &

Ophir Frieder, 199837

Space Related Terminology

Kilobyte (KB) 2^10 or 1024 bytes

Megabyte (MB) 2^20 or 1024 kilobytes

Gigabyte (GB) 2^30 or 1024 gigabytes

Terabyte (TB) 2^40 or 1024 megabytes

Petabyte (PB) 2^50 or 1024 terabytes

Exabyte (EB) 2^60 or 1024 petabytes

Zetabyte (ZB) 2^70 or 1024 exabytes

Yottabyte (YB) 2^80 or 1024 zetabytes