8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 1/27
What is a Data Warehouse?What is a Data Warehouse?
Shipra VarshneyShipra Varshney
LectureLecture ±±MBAMBA
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 2/27
What Is a Data Warehouse?What Is a Data Warehouse?
Nobody can agreeNobody can agree
So I¶m not actually going to define a DWSo I¶m not actually going to define a DW
Don¶t feel cheated, thoughDon¶t feel cheated, though
By the end of this talk, you¶llBy the end of this talk, you¶ll�� Understand key concepts that underlie allUnderstand key concepts that underlie all
warehouse implementations (³talk the talk´)warehouse implementations (³talk the talk´)
�� Understand the various components out of Understand the various components out of
which DW architects construct realwhich DW architects construct real--world dataworld datawarehouseswarehouses
�� Understand what a data warehouse projectUnderstand what a data warehouse projectlooks likelooks like
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 3/27
Why Are Schools Setting UpWhy Are Schools Setting Up
Data Warehouses?Data Warehouses? A data warehouse makes it easier to:A data warehouse makes it easier to:
�� Optimize classroom, computer lab usageOptimize classroom, computer lab usage�� Refine admissions ratings systemsRefine admissions ratings systems�� Forecast future demand f or courses, majorsForecast future demand f or courses, majors�� Tie private spreadsheet data into central repositoriesTie private spreadsheet data into central repositories
�� Correlate admissions and IR data with outcomes such as:Correlate admissions and IR data with outcomes such as: GPAsGPAs Placement ratesPlacement rates Happiness, as measured by alumni surveysHappiness, as measured by alumni surveys
�� Notify advisors when extra help may be needed based onNotify advisors when extra help may be needed based on Admissions data (student vitals; SAT, etc.)Admissions data (student vitals; SAT, etc.)
Special events: ASpecial events: A--student suddenly gets a C in his/her majorstudent suddenly gets a C in his/her major Slower trends: Student¶s GPA falls f or > 2 semesters/termsSlower trends: Student¶s GPA falls f or > 2 semesters/terms
�� (Many other examples could be given!)(Many other examples could be given!)
Better inf ormation = better decisionsBetter inf ormation = better decisions�� Better admission decisionsBetter admission decisions�� Better retention ratesBetter retention rates�� More effective fund raising, etc.More effective fund raising, etc.
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 4/27
Talking The TalkTalking The Talk
To think and communicate usefully about data warehousesTo think and communicate usefully about data warehousesyou¶ll need to understand a set of common terms and you¶ll need to understand a set of common terms and concepts:concepts:�� OLTPOLTP�� ODSODS
�� OLAP, ROLAP, MOLAPOLAP, ROLAP, MOLAP�� ETLETL�� Star schemaStar schema�� Conf ormed dimensionConf ormed dimension�� Data martData mart�� CubeCube
�� MetadataMetadata Even if you¶re not an IT person, pay heed:Even if you¶re not an IT person, pay heed:
�� You¶ll have to communicate with IT peopleYou¶ll have to communicate with IT people�� More importantly:More importantly:
Evidence shows that IT will only build a successful warehouse if Evidence shows that IT will only build a successful warehouse if youyouare intimately involved!are intimately involved!
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 5/27
OLTPOLTP
OLTP =OLTP = online transaction processingonline transaction processing
The process of moving data around to The process of moving data around to handle dayhandle day--toto--day affairsday affairs
�� Scheduling classesScheduling classes
�� Registering studentsRegistering students
�� Tracking benefitsTracking benefits
�� Recording payments, etc.Recording payments, etc.
Systems supporting this kind of activitySystems supporting this kind of activityare called are called transactional systemstransactional systems
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 6/27
Transactional SystemsTransactional Systems Transactional systems are optimized primarily f orTransactional systems are optimized primarily f or
thethe here and now here and now �� Can support many simultaneous usersCan support many simultaneous users�� Can support heavy read/write accessCan support heavy read/write access�� Allow f or constant changeAllow f or constant change
�� Are big, ugly, and often don¶t give people the data theyAre big, ugly, and often don¶t give people the data theywantwant As a result a lot of data ends up in shadow databasesAs a result a lot of data ends up in shadow databases Some ends up locked away in private spreadsheetsSome ends up locked away in private spreadsheets
Transactional systems don¶t record all previousTransactional systems don¶t record all previousdata statesdata states
Lots of data gets thrown away or archived, e.g.:Lots of data gets thrown away or archived, e.g.:�� Admissions dataAdmissions data�� Enrollment dataEnrollment data�� Asset tracking data (³How many computers did weAsset tracking data (³How many computers did we
support each year, from 1996 to 2006, and where do wesupport each year, from 1996 to 2006, and where do weexpect to be in 2010?´)expect to be in 2010?´)
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 7/27
Simple Transactional DatabaseSimple Transactional Database
Map of MicrosoftMap of MicrosoftWindows UpdateWindows UpdateService (WUS)Service (WUS)
backback--end databaseend database�� Diagrammed usingDiagrammed usingSybaseSybasePowerDesignerPowerDesigner Each green box is aEach green box is a
database ³table´ database ³table´ Arrows are ³joins´ orArrows are ³joins´ or
f oreign keysf oreign keys
This isThis is simplesimple f or anf or anOLTP back endOLTP back end
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 8/27
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 9/27
ODSODS
ODS =ODS = operational data storeoperational data store
ODSs were an early workaround to the ³reportingODSs were an early workaround to the ³reportingproblem´ problem´
To create an ODS youTo create an ODS you�� Build a separate/simplified version of an OLTP systemBuild a separate/simplified version of an OLTP system
�� Periodically copy data into it from the live OLTP systemPeriodically copy data into it from the live OLTP system
�� Hook it to operational reporting toolsHook it to operational reporting tools
An ODS can be an integration point or realAn ODS can be an integration point or real--timetime
³reporting database´ f or an operational system³reporting database´ f or an operational system It¶s not enough f or full enterpriseIt¶s not enough f or full enterprise--level, crosslevel, cross--databasedatabase analytical processinganalytical processing
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 10/27
OLAPOLAP
OLAP =OLAP = online analytical processingonline analytical processing OLAP is the process of creating and OLAP is the process of creating and
summarizing historical, multidimensionalsummarizing historical, multidimensional
datadata�� To help users understand the data betterTo help users understand the data better�� Provide a basis f or inf ormed decisionsProvide a basis f or inf ormed decisions�� Allow users to manipulate and explore dataAllow users to manipulate and explore data
themselves, easily and intuitivelythemselves, easily and intuitively
More than just ³reporting´ More than just ³reporting´ Reporting is just one (static) product of Reporting is just one (static) product of
OLAPOLAP
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 11/27
OLAP Support DatabasesOLAP Support Databases
OLAP systems require support databasesOLAP systems require support databases
These databases typicallyThese databases typically�� Support fewer simultaneous users thanSupport fewer simultaneous users than OLOLT T P P
back endsback ends�� Are structured simply; i.e., Are structured simply; i.e., denormalized denormalized
�� Can grow largeCan grow large Hold snapshots of data in OLTP systemsHold snapshots of data in OLTP systems
Provide history/time depth to our analysesProvide history/time depth to our analyses�� Are optimized f or read (not write) accessAre optimized f or read (not write) access
�� Updated via periodic batch (e.g., nightly)Updated via periodic batch (e.g., nightly) E TLE TL
processesprocesses
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 12/27
ETL ProcessesETL Processes
ETL = extract, transf orm, loadETL = extract, transf orm, load�� Ex tract Ex tract data from various sourcesdata from various sources�� T ransformT ransform and clean the data from those sourcesand clean the data from those sources�� Load Load the data into databases used f or analysis and the data into databases used f or analysis and
reportingreporting
ETL processes are coded in various waysETL processes are coded in various ways�� By hand in SQL, UniBASIC, etc.By hand in SQL, UniBASIC, etc.�� Using more general programming languagesUsing more general programming languages�� In semiIn semi--automated fashion using specialized ETL toolsautomated fashion using specialized ETL tools
like Cognos Decision Streamlike Cognos Decision Stream
Most institutions do hand ETL; but note well:Most institutions do hand ETL; but note well:�� Hand ETL is slowHand ETL is slow�� Requires specialized knowledgeRequires specialized knowledge�� Becomes extremely difficult to maintain as codeBecomes extremely difficult to maintain as code
accumulates and databases/personnel change!accumulates and databases/personnel change!
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 13/27
Where Does the Data Go?Where Does the Data Go?
What sort of a database do the ETLWhat sort of a database do the ETLprocesses dump data into?processes dump data into?
Typically, into very simple tableTypically, into very simple tablestructuresstructures
These table structures are:These table structures are:
�� DenormalizedDenormalized�� Minimally branched/hierarchizedMinimally branched/hierarchized
�� Structured into Structured into star sc hemasstar sc hemas
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 14/27
So What Are Star Schemas?So What Are Star Schemas?
Star schemas are collections of data arranged Star schemas are collections of data arranged into starinto star--like patternslike patterns�� They haveThey have fact tablesfact tables in the middle, which containin the middle, which contain
amounts, measures (like counts, dollar amounts, GPAs)amounts, measures (like counts, dollar amounts, GPAs)
�� DimensionDimension tables around the outside, which containtables around the outside, which containlabels and classifications (like names, geocodes, majors)labels and classifications (like names, geocodes, majors)
�� For faster processing, For faster processing, aggregate fact tablesaggregate fact tables arearesometimes also used (e.g., counts presometimes also used (e.g., counts pre--averaged f or anaveraged f or anentire term)entire term)
Star schemas shouldStar schemas should�� Have descriptive column/field labelsHave descriptive column/field labels
�� Be easy f or users to understandBe easy f or users to understand
�� Perf orm well on queriesPerf orm well on queries
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 15/27
A Very Simple Star SchemaA Very Simple Star Schema
Data Center UPSData Center UPS
Power OutputPower Output
Dimensions:Dimensions:
PhasePhase
TimeTime
DateDate
Facts:Facts:
VoltsVolts
AmpsAmps
Etc.Etc.
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 16/27
A More Complex Star SchemaA More Complex Star Schema
Freshman surveyFreshman surveydata (HERI/CIRP)data (HERI/CIRP)
Dimensions:Dimensions:�� QuestionsQuestions
�� Survey yearsSurvey years�� Data about testData about test
takerstakers
Facts:Facts:�� Answer (text)Answer (text)
�� Answer (raw)Answer (raw)
�� Count (1)Count (1)
OopsOops�� Not a starNot a star
�� Snowflaked!Snowflaked!Oops, answers should have been placed in theirown dimension (creating a ³factless fact table´).I¶ll demo a better version of this star later!
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 17/27
Data MartsData Marts One definition:One definition:
�� One or more star schemas that present data on a singleOne or more star schemas that present data on a singleor related set of business processesor related set of business processes
Data marts should Data marts should not not be built in isolationbe built in isolation They need to be connected via dimensional tablesThey need to be connected via dimensional tables
that arethat are�� The same or subsets of each otherThe same or subsets of each other�� Hierarchized the same way internallyHierarchized the same way internally
So, e.g., if I construct data marts f or«So, e.g., if I construct data marts f or«�� GPA trends, student major trends, enrollmentsGPA trends, student major trends, enrollments�� Freshman survey data, senior survey data, etc.Freshman survey data, senior survey data, etc.
«I connect these marts via a conf ormed «I connect these marts via a conf ormed student student dimensiondimension�� Makes correlation of data across star schemas intuitiveMakes correlation of data across star schemas intuitive�� Makes it easier f or OLAP tools to use the dataMakes it easier f or OLAP tools to use the data�� Allows nonspecialists to do much of the workAllows nonspecialists to do much of the work
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 18/27
Simple Data Mart ExampleSimple Data Mart ExampleUPSUPSBattery starBattery star
By batteryBy batteryRunRun--timetime% charged% chargedCurrentCurrent
Input starInput starBy phaseBy phaseVoltageVoltageCurrentCurrent
Output starOutput starBy phaseBy phase
VoltageVoltage
CurrentCurrent
Sensor starSensor starBy sensorBy sensor
TempTempHumidityHumidity
Note conf ormed date,
time dimensions!
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 19/27
ROLAP, MOLAPROLAP, MOLAP
ROLAP = OLAP via direct relational queryROLAP = OLAP via direct relational query
�� E.g., against a (materialized) viewE.g., against a (materialized) view
�� Against star schemas in a warehouseAgainst star schemas in a warehouse
MOLAP = OLAP via multidimensionalMOLAP = OLAP via multidimensionaldatabase (MDB)database (MDB)
�� MDB is a special kind of databaseMDB is a special kind of database
�� Treats data kind of like a big, fast spreadsheetTreats data kind of like a big, fast spreadsheet
�� MDBs typically draw data in from a dataMDBs typically draw data in from a datawarehousewarehouse
Built to work best withBuilt to work best with star sc hemasstar sc hemas
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 20/27
MetadataMetadata Metadata =Metadata = data about datadata about data In a data warehousing context it can mean manyIn a data warehousing context it can mean many
thingsthings�� Inf ormation on data in source OLTP systemsInf ormation on data in source OLTP systems�� Inf ormation on ETL jobs and what they do to the dataInf ormation on ETL jobs and what they do to the data
�� Inf ormation on data in marts/star schemasInf ormation on data in marts/star schemas�� Documentation in OLAP tools on the data theyDocumentation in OLAP tools on the data they
manipulatemanipulate
Many institutions make metadata available viaMany institutions make metadata available viadata malls or warehouse portals, e.g.:data malls or warehouse portals, e.g.:
�� University of New MexicoUniversity of New Mexico�� UC DavisUC Davis�� Rensselear Polytechnic InstituteRensselear Polytechnic Institute�� University of IllinoisUniversity of Illinois
Good ETL tools automate the setup of Good ETL tools automate the setup of malls/portals!malls/portals!
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 21/27
The Data WarehouseThe Data Warehouse
OK now we¶re experts in terms like OLTP, OLAP, OK now we¶re experts in terms like OLTP, OLAP, star schema, metadata, etc.star schema, metadata, etc.
Let¶s use some of these terms to describe how aLet¶s use some of these terms to describe how aDW works:DW works:
�� Provides ample metadataProvides ample metadata ±± data about the datadata about the data�� Utilizes easyUtilizes easy--toto--understand column/field namesunderstand column/field names�� Feeds multidimensional databases (MDBs)Feeds multidimensional databases (MDBs)�� Is updated via periodic (mainly nightly) ETL jobsIs updated via periodic (mainly nightly) ETL jobs�� Presents data in a simplified, denormalized f ormPresents data in a simplified, denormalized f orm�� Utilizes starUtilizes star--like fact/dimension table schemaslike fact/dimension table schemas
�� Encompasses multiple, smaller data ³marts´ Encompasses multiple, smaller data ³marts´ �� Supports OLAP tools (Access/Excel, Safari, Cognos BI)Supports OLAP tools (Access/Excel, Safari, Cognos BI)�� Derives data from (multiple) backDerives data from (multiple) back--end OLTP systemsend OLTP systems�� Houses historical data, and Houses historical data, and cancan grow very biggrow very big
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 22/27
A Data Warehouse is Not«A Data Warehouse is Not«
Vendor and consultant proclamationsVendor and consultant proclamationsaside, a data warehouse is not:aside, a data warehouse is not:�� A projectA project
With a specific end dateWith a specific end date
�� A product you buy from a vendorA product you buy from a vendor Like an ODS (such as SCT¶s)Like an ODS (such as SCT¶s) A canned ³warehouse´ supplied by iStrategyA canned ³warehouse´ supplied by iStrategy Cognos ReportNetCognos ReportNet
�� A database schema or instanceA database schema or instance Like OracleLike Oracle SQL ServerSQL Server
�� A cutA cut--down version of your live transactionaldown version of your live transactionaldatabasedatabase
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 23/27
Kimball & Caserta¶s DefinitionKimball & Caserta¶s Definition
According to Ralph Kimball and JoeAccording to Ralph Kimball and JoeCaserta, a data warehouse is:Caserta, a data warehouse is:
A system that extracts, cleans, conf orms, and A system that extracts, cleans, conf orms, and delivers source data into adelivers source data into a dimensional datadimensional datastorestore and then supports and implementsand then supports and implementsquerying and analysis f or the purpose of querying and analysis f or the purpose of decision making.decision making.
Another def.: The union of all the enterprise¶s data martsAnother def.: The union of all the enterprise¶s data marts
Aside: The Kimball model is not without some critics:Aside: The Kimball model is not without some critics:�� E.g., BillE.g., Bill InmonInmon
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 24/27
Example Data Warehouse (1)Example Data Warehouse (1)
This one isThis one isRPI¶sRPI¶s
5 parts:5 parts:�� SourcesSources
�� ETL stuff ETL stuff
�� DW properDW proper
�� Cubes etc.Cubes etc.�� OLAP appsOLAP apps
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 25/27
Implementing a Data WarehouseImplementing a Data Warehouse
In many organizations IT people want to huddle and workIn many organizations IT people want to huddle and workout a warehousing plan, but in factout a warehousing plan, but in fact�� The purpose of a DW is decision supportThe purpose of a DW is decision support�� The primary audience of a DW is theref ore College decisionThe primary audience of a DW is theref ore College decision
makersmakers
�� It is College decision makers theref ore who must determineIt is College decision makers theref ore who must determine ScopeScope PriorityPriority ResourcesResources
Decision makers can¶t make these determinations withoutDecision makers can¶t make these determinations withoutan understanding of data warehousesan understanding of data warehouses
It is theref ore imperative that key decision makers first beIt is theref ore imperative that key decision makers first beeducated about data warehouseseducated about data warehouses�� Once this occurs, it is possible toOnce this occurs, it is possible to
Elicit requirements (a critical step that¶s often skipped)Elicit requirements (a critical step that¶s often skipped) Determine priorities/scopeDetermine priorities/scope Formulate a budgetFormulate a budget Create a plan and timeline, with real milestones and deliverables!Create a plan and timeline, with real milestones and deliverables!
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 26/27
What Takes Up the Most Time?What Takes Up the Most Time?
You may be surprised You may be surprised to learn what DW stepto learn what DW steptakes the most timetakes the most time
Try guessing which:Try guessing which:�� HardwareHardware
�� Physical database setupPhysical database setup
�� Database designDatabase design
�� ETLETL
�� OLAP setupOLAP setup
Acc. to Kimball & Caserta, ETL will eat up 70% of the time.Other analysts give estimates ranging from 50% to 80%.
The most often underestimated part of the warehouse
project!
0
10
20
30
40
50
60
70
80
90
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
East
West
North
Hardware
Database
ETL
Schemas
OLAP tools
8/7/2019 DATA WARE HOUSE in brief
http://slidepdf.com/reader/full/data-ware-house-in-brief 27/27
ConclusionConclusion
Inf ormation is held in transactional systemsInf ormation is held in transactional systems�� But transactional systems are complexBut transactional systems are complex�� They don¶t talk to each other well; each is a siloThey don¶t talk to each other well; each is a silo�� They require specially trained people to report off of They require specially trained people to report off of
For normal people to explore institutional data, data inFor normal people to explore institutional data, data intransactional systems needs to betransactional systems needs to be�� Renormalized as star schemasRenormalized as star schemas�� Moved to a system optimized f or analysisMoved to a system optimized f or analysis�� Merged into a unified whole in aMerged into a unified whole in a data warehousedata warehouse
Note: This process must be led by ³customers´ Note: This process must be led by ³customers´ �� Yes, IT people must build the infrastructureYes, IT people must build the infrastructure�� But IT people aren¶t the main customersBut IT people aren¶t the main customers
So who are the customers?So who are the customers?�� Admissions officers trying to make good admission decisionsAdmissions officers trying to make good admission decisions�� Student counselors trying to find/help students at riskStudent counselors trying to find/help students at risk�� Development offers raising funds that support the CollegeDevelopment offers raising funds that support the College�� Alumni affairs people trying to manage volunteersAlumni affairs people trying to manage volunteers�� Faculty deans trying to rightFaculty deans trying to right--size departmentssize departments�� IT people managing software/hardware assets, etc«.IT people managing software/hardware assets, etc«.