13
DATA WAREHOUSING ARCHITECTURE 1. Starter 2 2. Six Steps to Develop the Architecture 2 3. The Data Warehouse Infrastructure 3 4. Data Warehouse System Infrastructure 3 5. Data Layer components 4 6. Ongoing Maintenance: Warehouse Infrastructure 5 7. What is Data Warehouse A rchitecture? 7 7.1 Components 7 8. Different possible wrong architectures 10  8.1 “Virtual” Data Warehouse 11  8.2 Data Mart in a box Architecture 12 DA  A  A  T  T  T  A  A  A  WA  A A R R R E E H H O O U U U S S S I I N N N G G  Th h h e e e g g g o o o a a a l l l i i i s s s t t t o o e e n n n a a a b b b l l l e e u u u s s s e e e r r r s s s t t t o o m ma a a k k k e e e i i i n n n f f f o o r r m m m e e e d d d d d e e e c c c i i i s s s i i i o o n n n s s r r r a a a  p  p  p i i i d d d l l l  y  y  y s s o o o t t h h h e e i i i r r c c c o o m m  p  p  p a a a n n n i i i e e s s c c c a a a n n n r r r e e e s s s  p  p  p o o n n n d d d to o m m a a a k k k e e e c c c h h h a a a n n n g g g e e e a a n n n d d d r r e e m ma a a i i i n n n c c o o o m m m  p  p  p e e t t i i i t t i i i  v  v v e e e . .

ch3_DW_arc

Embed Size (px)

Citation preview

Page 1: ch3_DW_arc

8/14/2019 ch3_DW_arc

http://slidepdf.com/reader/full/ch3dwarc 1/13

DATA WAREHOUSING ARCHITECTURE

1. Starter 2

2. Six Steps to Develop the Architecture 2

3. The Data Warehouse Infrastructure 3

4. Data Warehouse System Infrastructure 3

5. Data Layer components 4

6. Ongoing Maintenance: Warehouse Infrastructure 5

7. What is Data Warehouse Architecture? 77.1 Components 7

8. Different possible wrong architectures 10

  8.1 “Virtual” Data Warehouse 11  8.2 Data Mart in a box Architecture 12

DA  A  A  T T T A  A  A 

 WA  A  A R R R EEEHHHOOOUUUSSSIIINNNGGG

 Thhheee gggoooaaalll iiisss tttooo eeennnaaabbbllleee uuussseeerrrsss tttooo mmmaaak k k eee iiinnnf f f ooorrrmmmeeeddd

ddeeeccciiisssiiiooonnnsss rrraaa p p piiidddlll y y y sssooo ttthhheeeiiirrr cccooommm p p paaannniiieeesss cccaaannn rrreeesss p p pooonnnddd

tooo mmmaaak k k eee ccchhhaaannngggeee aaannnddd rrreeemmmaaaiiinnn cccooommm p p peeetttiiitttiii v  v  v eee...

Page 2: ch3_DW_arc

8/14/2019 ch3_DW_arc

http://slidepdf.com/reader/full/ch3dwarc 2/13

SUSHIL KULKARNI 2

[email protected]

1. Starter

 Architecture of a data warehouse is a very complex and involves many elements. This isbecause the architecture of data warehouse consists of many different systems and toconnect as we as process these systems. For construction of any corporate datawarehouse required technical infrastructure that includes operating system, hardware

platform, database management system, and network. The DBMS selection becomes alittle more complicated than a straightforward operational system because of theunusual challenges of the data warehouse, especially in its capability to support verycomplex queries that cannot be predicted in advance. This will be explored in moredetail later on in the chapter.

In this chapter we will answer the following questions:

1.  How does the data get into the data warehouse?

2.  The warehouse requires ongoing processes to feed it; these processes require their

own infrastructure. What is the infrastructure required for data warehouse?

3.  Many times, IT departments overlook the above aspect when they plan for the datawarehouse. You required different layers for storing layers so what is Data layers?

4.  How the data can be clean and what are the steps? How will ongoing data loads,cleansing, and summarizing be accomplished?

5.  How will users get information out of the warehouse? The choice of query toolbecomes very important, and depends upon a multiplicity of factors.

2. Six Steps to Develop the Architecture

Following are different steps to develop architecture. These steps are to be performedaccording to the order in which they are given:

1. The most important step in developing effective data warehouse architecture is toenlist the full support/commitment (project sponsor) of an executive of thecompany.

2. Next, you must staff an architecture team with strong personnel. It is notnecessarily the technology you choose for your architecture, it is the personnel you

have designing and developing the architecture that makes the project successful.

3. Prototype/benchmark all the technologies you are interested in using. Design anddevelop a prototype that can be used to test all of the different technologies that arebeing considered.

4. Give the architecture team enough time to build the architecture infrastructurebefore development begins. For a large organization, this can be anywhere from sixmonths to a year or more.

Page 3: ch3_DW_arc

8/14/2019 ch3_DW_arc

http://slidepdf.com/reader/full/ch3dwarc 3/13

SUSHIL KULKARNI 3

[email protected]

5. Make sure you train the development staff on the use of the architecture before

development begins. Spend time letting the development team get full exposure tothe capabilities and components of the architecture.

6. Provide the architecture team an opportunity to enhance and improve thearchitecture as the project moves forward. No matter how much time is spent up

front developing an architecture, it will not be perfect the first time around.

 As we examine the architecture of a data warehouse, we will look at it from three views:the overall data warehouse infrastructure, data layer components, and ongoingmaintenance infrastructure.

3. The Data Warehouse Infrastructure

The data warehouse consists of the following architectural components, which composethe data warehouse infrastructure:

o  System infrastructure: Hardware, software, network, database managementsystem, and personnel components of the infrastructure.

o  Metadata layer: Data about data. This includes, but is not limited to, definitionsand descriptions of data items and business rules.

o  Data discovery: The process of understanding the current environment so it canbe integrated into the warehouse.

o  Data acquisition: The process of loading data from the various sources.

o  Data distribution: The dissemination/replication of data to distributed data martsfor specific segmented groups.

o  User analysis: Includes the infrastructure required to support user queries andanalysis.

4. Data Warehouse System Infrastructure

The technical architecture of a data warehouse is an important component. The reasonfor this is that the technical architecture is used as the base for building all the other

data warehouse components. This is why the technical architecture is called theinfrastructure.

The infrastructure foundation upon which the data warehouse is built is often called theplatform. It is made up of the following components:

Hardware, including operating system: Should be open, meaning that a variety of tools are able to run on the platform, and data is able to flow to/from the platform with

Page 4: ch3_DW_arc

8/14/2019 ch3_DW_arc

http://slidepdf.com/reader/full/ch3dwarc 4/13

SUSHIL KULKARNI 4

[email protected]

a minimal amount of effort required. Most of the hardware of a data warehouse will

consist of a number of large machines. Large machines are 6 to 8 or even 12 CPUs witha gigabyte(s) of memory and many gigabytes or even a terabyte of disk space.

Network: Should minimize complexity, maximize bandwidth. Should connect (directly)all components and locations of the corporate enterprise that need access to the datawarehouse.

Software: Of course, the most important software component of a data warehouse isthe Database Management System (DBMS) (seen in the first chapter). However, thereare other important software components as well: the monitoring, administration, andnetwork management tools used to maintain the database; the software used to supportuser access; and data modeling tools used by the development staff to design,implement, and maintain the data warehouse and third-party utilities.

Personnel: This may seem like an odd component of data warehouse architecture, butit is the most important (and the most expensive!). There are a number of technology

choices on the market today, and each of these technologies has good features and badfeatures. Therefore, choosing the right components is not an exact science. The keyfactor in whether or not the technology will work is the skill level of the individualsdesigning and developing the architecture. Good components and experiencedarchitects/developers will make the difference in the end.

5. Data Layer components

Figure A illustrates the overall high-level data architectural components required for thetypical data warehouse effort. For building a data warehouse, we typically build at leasttwo separate databases: an interim "staging area" and the warehouse itself.

When the data is loaded into the warehouse, if it comes over from the legacy system asis with no transformation, it is considered Level 0.

 Alternatively, some scrubbing can take place on the legacy system (if the load of thesystem allows it). Some tools enable you to enter scrub rules and the tool will generatecode, which can then be executed in various ways.

If some initial scrubbing occurs on the legacy side, then the data coming over to theinterim staging area is called Level 1, because it has already been processed once.Each time data is scrubbed, the data level is incremented.

The data is placed in the interim environment so scrubbing can take place. Often,primary keys must be resolved. Sometimes one row of data in a warehouse table will besource by more than one legacy system. The primary key is pieced together in theinterim environment. This must take place first before any other scrubbing can occur;you must have a proper identifier for each row before proceeding. Often, more than oneiteration of scrubbing occurs in the staging area. At level 2 scrubbing takes place andall the legacy systems are aggregated at level 3 and final answer is obtained from level4.

Page 5: ch3_DW_arc

8/14/2019 ch3_DW_arc

http://slidepdf.com/reader/full/ch3dwarc 5/13

SUSHIL KULKARNI 5

[email protected]

Figure B shows an alternate way with four data levels. Level 0 is the straight extract,taken as is from the legacy environment. Level 1 takes place in the interim staging

area; its main purpose is to create a primary key, a single field that will serve as theunique identifier for each row. Then, Level 2 cleans up miscellaneous data anomaliessuch as replacing non-standard project codes for the approved values. Another set of scrub routines then performs summarization information that will be stored in the datawarehouse. This summarization is Level 3. Some warehouses may have one or morelevels of summarization. Level 4 shows more granular summaries calculated.

Most data warehouses in the real world don't have all of these levels shown. As statedpreviously, if scrubbing takes place before the data is shipped to the interimenvironment, Level 0 is not even shown. It is never represented in the warehouse orstored.

6. Ongoing Maintenance: Warehouse Infrastructure

Data warehouses are fed periodically, and repeatable processes must be in place for thisto occur. The following processes are part of this iterative cycle:

o  Extract data from source system database

o  Export data from source environment to warehouse platform

LEVEL 2 : Scrub

LEVEL 0 : Extract without scrub

LEVEL 3 : Aggregate

LEVEL 4 : Final Result

LEVEL 1 : Create Primary key Integrate

LEVEL 3 : Scrub

LEVEL 1 : Extract and Cleaning

LEVEL 4 : Aggregate

LEVEL 2 : Integrate

  FIGURE A FIGURE B

Page 6: ch3_DW_arc

8/14/2019 ch3_DW_arc

http://slidepdf.com/reader/full/ch3dwarc 6/13

SUSHIL KULKARNI 6

[email protected]

o  Copy data into interim staging area database

o  Perform necessary scrubs

o  Process errors (such as moving rows with bad data into an error table, flaggingproblem rows, etc.)

o  Perform summarization/aggregation

o  Load data from interim staging area into the warehouse

o  Perform backup if required

o  Propagate subscribed data to data marts as required

DW

OLAP

Adhoc

SQL

  User 

View

Queries/reports

Data mining

Page 7: ch3_DW_arc

8/14/2019 ch3_DW_arc

http://slidepdf.com/reader/full/ch3dwarc 7/13

SUSHIL KULKARNI 7

[email protected]

This cycle is repeated every time a load is performed. Following figure shows these

essential architectural components required for ongoing maintenance of the datawarehouse.

7. What is Data Warehouse Architecture?

Data Warehouse Architecture is a description of the components and services of thewarehouse, how they fit together and how they will grow. These descriptions shouldcontain enough information to allow a skilled professional to implement the architecture.

 Architecture provides the mechanism to achieve enterprise integration to supportbusiness. It provides an organizing framework that will improve data sharing betweenagencies, and in the long run allow for faster development, reuse and consistent databetween warehouse projects. Most importantly, this architecture is an evolutionaryprocess. The architecture as defined here was initially developed as a place to start. Thefirst enterprise warehouse projects will be based on this architecture. Increments of 

additional agency projects will cause this architecture to evolve. As technology changesand improves, that too will most likely require us to make adjustments to thisarchitecture. This incremental development of both the architecture and the warehouseoffers an opportunity to learn and to minimize the impact of mistakes.

7.1 Components

The architecture is made up of a number of interconnected parts called components orlayers and are as follows:

o  Operational Database / External Database Layer

o  Information Access Layer

o  Data Access Layero  Data Directory (Metadata) Layer

o  Process Management Layer

o   Application Messaging Layer

o  Data Warehouse Layer

o  Data Staging Layer

Page 8: ch3_DW_arc

8/14/2019 ch3_DW_arc

http://slidepdf.com/reader/full/ch3dwarc 8/13

SUSHIL KULKARNI 8

[email protected]

[A] Operational Database / External Database Layer

Operational systems process data to support critical operational needs. In order to dothat, operational databases have been historically created to provide an efficientprocessing structure for a relatively small number of well-defined business transactions.However, because of the limited focus of operational systems, the databases designedto support operational systems have difficulty accessing the data for other managementor informational purposes. This difficulty in accessing operational data is amplified by thefact that many operational systems are often 10 to 15 years old. The age of some of these systems means that the data access technology available to obtain operationaldata is itself dated.

Clearly, the goal of data warehousing is to free the information that is locked up in theoperational databases and to mix it with information from other, often external, sourcesof data. Increasingly, large organizations are acquiring additional data from outsidedatabases. This information includes demographic, econometric, competitive andpurchasing trends. The so-called "information superhighway" is providing access to more

data resources every day.

[B] Information Access Layer

The Information Access layer of the Data Warehouse Architecture is the layer that theend-user deals with directly. In particular, it represents the tools that the end-usernormally uses day to day, e.g., Excel, Lotus 1-2-3, Focus, Access, SAS, etc. This layeralso includes the hardware and software involved in displaying and printing reports,spreadsheets, graphs and charts for analysis and presentation. Over the past twodecades, the Information Access layer has expanded enormously, especially as end-users have moved to PCs and PC/LANs.

Today, more and more sophisticated tools exist on the desktop for manipulating,analyzing and presenting data; however, there are significant problems in making the

Page 9: ch3_DW_arc

8/14/2019 ch3_DW_arc

http://slidepdf.com/reader/full/ch3dwarc 9/13

SUSHIL KULKARNI 9

[email protected]

raw data contained in operational systems available easily and seamlessly to end-user

tools. One of the keys to this is to find a common data language that can be usedthroughout the enterprise.

[C] Data Access Layer

The Data Access Layer of the Data Warehouse Architecture is involved with allowing theInformation Access Layer to talk to the Operational Layer. In the network world today,the common data language that has emerged is SQL. Originally, SQL was developed byIBM as a query language, but over the last twenty years has become the de factostandard for data interchange.

One of the key breakthroughs of the last few years has been the development of aseries of data access "filters" such as EDA/SQL that make it possible for SQL to accessnearly all DBMSs and data file systems, relational or nonrelational. These filters make itpossible for state-of-the-art Information Access tools to access data stored on databasemanagement systems that are twenty years old.

The Data Access Layer not only spans different DBMSs and file systems on the samehardware, it spans manufacturers and network protocols as well. One of the keys to aData Warehousing strategy is to provide end-users with "universal data access".Universal data access means that, theoretically at least, end-users, regardless of locationor Information Access tool, should be able to access any or all of the data in theenterprise that is necessary for them to do their job.

The Data Access Layer then is responsible for interfacing between Information Access

tools and Operational Databases. In some cases, this is all that certain end-users need.However, in general, organizations are developing a much more sophisticated scheme tosupport Data Warehousing.

[D] Data Directory (Metadata) Layer

In order to provide for universal data access, it is absolutely necessary to maintain someform of data directory or repository of meta-data information. Meta-data is the dataabout data within the enterprise. Record descriptions in a COBOL program are meta-data. So are DIMENSION statements in a FORTRAN program, or SQL Create statements.The information in an ERA diagram is also meta-data.

In order to have a fully functional warehouse, it is necessary to have a variety of meta-data available, data about the end-user views of data and data about the operational

databases. Ideally, end-users should be able to access data from the data warehouse(or from the operational databases) without having to know where that data resides orthe form in which it is stored.

[E] Process Management Layer

The Process Management Layer is involved in scheduling the various tasks that must be

Page 10: ch3_DW_arc

8/14/2019 ch3_DW_arc

http://slidepdf.com/reader/full/ch3dwarc 10/13

SUSHIL KULKARNI 10

[email protected]

accomplished to build and maintain the data warehouse and data directory information.

The Process Management Layer can be thought of as the scheduler or the high-level jobcontrol for the many processes (procedures) that must occur to keep the DataWarehouse up-to-date.

[F] Application Messaging Layer

The Application Message Layer has to do with transporting information around theenterprise computing network. Application Messaging is also referred to as"middleware", but it can involve more that just networking protocols. ApplicationMessaging for example can be used to isolate applications, operational or informational,from the exact data format on either end. Application Messaging can also be used tocollect transactions or messages and deliver them to a certain location at a certain time.

 Application Messaging in the transport system underlying the Data Warehouse.

[G] Data Warehouse (Physical) Layer

The (core) Data Warehouse is where the actual data used primarily for informationaluses occurs. In some cases, one can think of the Data Warehouse simply as a logical orvirtual view of data. In many instances, the data warehouse may not actually involvestoring data.

In a Physical Data Warehouse, copies, in some cases many copies, of operational and orexternal data are actually stored in a form that is easy to access and is highly flexible.Increasingly, Data Warehouses are stored on client/server platforms, but they are oftenstored on main frames as well.

[H] Data Staging Layer

The final component of the Data Warehouse Architecture is Data Staging. Data Stagingis also called copy management or replication management, but in fact, it includes all of the processes necessary to select, edit, summarize, combine and load data warehouseand information access data from operational and/or external databases.

Data Staging often involves complex programming, but increasingly data warehousingtools are being created that help in this process. Data Staging may also involve dataquality analysis programs and filters that identify patterns and data structures withinexisting operational data.

8. Different possible wrong architectures

In the previous article you saw different components of data warehouse architecture. Todesign architecture the care should be taken so that the architecture is not faulty. In thisarticle you will see different types of architectures possible which are wrong. Many DataWarehouse projects fail due to the selection of an architecture that is incapable to meetbusiness requirements.

Page 11: ch3_DW_arc

8/14/2019 ch3_DW_arc

http://slidepdf.com/reader/full/ch3dwarc 11/13

SUSHIL KULKARNI 11

[email protected]

 A desire to build a Data Warehouse quickly and cheaply often leads to selection of a

wrong architecture. There exist architectures that are generally considered to be wrong:

o  “Virtual” Data Warehouse

o  “Data Mart in a Box” 

8.1 Virtual Data Warehouse

In this architecture there is no Data Warehouse database. The business analysts accessoperational databases using simple OLAP front-end tools. This architecture is popularbecause it requires minimum investment in additional to hardware and software. Youdon’t require extra IT personal as well as there is no extracting, cleaning and loadingburden. The front-end data access and analysis tools simplify access to legacy databasesystems on mainframes, and allow multidimensional queries on views and drill-downoperations on operational data. Following figure depicts this architecture:

Following are some of the limitations of “Virtual” data warehouse:

1.  As there is no true data warehouse database is built, there is no:

o  Historical data,o  Summarized and aggregated data,

o  Central meta data repository with enterprise wide definitions of the business datasemantics

o  Cleaning and transforming operational data to suit the decision making processes

2.  A “virtual” data warehouse can be considered as a really short time temporarysolution for the problem.

Page 12: ch3_DW_arc

8/14/2019 ch3_DW_arc

http://slidepdf.com/reader/full/ch3dwarc 12/13

SUSHIL KULKARNI 12

[email protected]

8.2 Data Mart in a box Architecture

 A packaged product that allows to build a Data Warehouse database from various dataSources and accessing Data Warehouse database using user friendly data access andanalysis tools. It also builds a local meta data repository with data definitions in businessterms. Following figure depicts this architecture:

Following are some of the advantages and disadvantages of the above architecture:

o  The data mart in a box architecture eliminates the interference of OLAP operations

with OLTPo  But it retains some of the old and introduces some new problems:

*  This architecture tends to proliferate in an uncontrolled manner leading tomultiple, non integrated, independent, local data marts, purchased fromdifferent vendors

*  Lack of support for common business rules, semantics, and data definitionsacross business areas (although every data mart maintains its own metadata repository)

* Population of data marts with “dirty” source data

Following are different dirty data Problem

o  Data stored in the legacy databases have high percentage of:missing, erroneous, or inconsistent data values. The examples of “dirty” data aremultiple attribute values in one field, one attribute value across two or more fields,different spellings for the same attribute vale, inconsistent names for legal entities,incorrect use of codes across records.

Page 13: ch3_DW_arc

8/14/2019 ch3_DW_arc

http://slidepdf.com/reader/full/ch3dwarc 13/13

SUSHIL KULKARNI 13

[email protected]

o  Up to 20% of fields can contain such “dirty” data.

To sum up, the benefits of having a data warehouse architecture are as follows:

o  Provides an organizing framework - the architecture draws the lines on the mapin terms of what the individual components are, how they fit together, who ownswhat parts, and priorities.

o  Improved flexibility and maintenance - allows you to quickly add new datasources, interface standards allow plug and play, and the model and meta data allowimpact analysis and single-point changes.

o  Faster development and reuse - warehouse developers are better able tounderstand the data warehouse process, data base contents, and business rulesmore quickly.

o  Management and communications tool - define and communicate direction andscope to set expectations, identify roles and responsibilities, and communicaterequirements to vendors.

o  Coordinate parallel efforts - multiple, relatively independent efforts have achance to converge successfully. Also, data marts without architecture become thestovepipes of tomorrow.

WWWWW