96
DATA WAREHOUSING AND DATA MINING M.Mageshwari,Lecturer M.S.P.V.L Polytechnic College

datawarehouse

  • Upload
    doewee

  • View
    35

  • Download
    1

Embed Size (px)

DESCRIPTION

datawarehouse

Citation preview

  • DATA WAREHOUSING ANDDATA MININGM.Mageshwari,LecturerM.S.P.V.L Polytechnic College

  • *Course OverviewThe course: what and how

    0. IntroductionI. Data WarehousingII. Decision Support and OLAPIII. Data MiningIV. Looking Ahead

    Demos and Labs

  • *0. IntroductionData Warehousing, OLAP and data mining: what and why (now)?Relation to OLTPA case study

    demos, labs

  • *A producer wants to know.

  • *Data, Data everywhereyet ...I cant find the data I needdata is scattered over the networkmany versions, subtle differences

    I cant get the data I needneed an expert to get the dataI cant understand the data I foundavailable data poorly documented

    I cant use the data I foundresults are unexpecteddata needs to be transformed from one form to other

  • *What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

  • *What are the users saying...Data should be integrated across the enterpriseSummary data has a real value to the organizationHistorical data holds the key to understanding data over timeWhat-if capabilities are required

  • *What is Data Warehousing? A process of transforming data into information and making it available to users in a timely enough manner to make a difference

  • *Evolution60s: Batch reportshard to find and analyze informationinflexible and expensive, reprogram every new request

    70s: Terminal-based DSS(Decision Support System and EIS (executive information systems)still inflexible, not integrated with desktop tools

  • *Data Warehouse Structurebase customer (1985-87)custid, from date, to date, name, phone, dobbase customer (1988-90)custid, from date, to date, name, credit rating, employercustomer activity (1986-89) -- monthly summarycustomer activity detail (1987-89)custid, activity date, amount, clerk id, order nocustomer activity detail (1990-91)custid, activity date, amount, line item no, order noTime is part of key of each table

  • Definition of DSS Decision support system is defined as a system that helps the decision makers in various levels to take decisions

    This system uses data, analytical models and user friendly software for taking decision*

  • Definition of EIS Executive information system(EIS) is defined as a system that helps the high level executives to take policy decisions.

    This system user higher level data, analytical models and user friendly software for taking decisions.*

  • Evolution80s: Desktop data access and analysis toolsquery tools, spreadsheets, GUIseasier to use, but only access operational databases

    90s: Data warehousing with integrated OLAP(online analytical processing)engines and tools

    *

  • *Data Warehousing -- It is a processTechnique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possibleA decision support database maintained separately from the organizations operational database

  • *

    Characteristics of Data WarehouseA data warehouse is a subject-orientedintegratedtime-varyingnon-volatilecollection of data that is used primarily in organizational decision making.

  • subject-oriented

    A data warehouse is organized around the major subjects of the organization such as customer, supplier, product, sales, etc..,

    Data warehouse provides a simple and concise view around a particular subject by excluding data that are not useful to the decision support process.

    *

  • Integrated:

    A data warehouse is constructed by integrating multiple sources of data such as relational database, flat files and on-line transaction records.Data cleaning and data integration techniques are applied to ensure consistency in naming conventions, encoding structures, attributes etc..,

    *

  • Time Variant

    Data warehouse maintains records of both historical and current data.So it can provide information in a historical perspective

    *

  • Non Volatile

    Once data warehouse is loaded with data, it is not possible to perform any modifications in the stored data.

    *

  • *Explorers, Farmers and TouristsExplorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed dataFarmers: Harvest informationfrom known access pathsTourists: Browse information about Tourists

  • *Application-Orientation vs. Subject-Orientation

  • Functioning of Data warehousing*Data Sourcecleaning

    TransformationData WarehouseNew Update

  • Collection dataData warehousing collect data from various data sources such as relational data base, flat files and on-line recordsThe collection of data are stored in database inside the warehouse.The type of data collection used depends on the architecture of the ware house.

    *

  • IntegrationEach and every data source uses from different schema.Data warehouse get data from different source with different schema and convert the data from various sources into a common integrated schema.*

  • *Star SchemaA single fact table and for each dimension one dimension tableDoes not capture hierarchies directlyT imeprodcustcity

    factdate, custno, prodno, cityname, ...

  • *Snowflake schemaRepresent dimensional hierarchy directly by normalizing tables. Easy to maintain and saves storageT imeprodcustcity

    factdate, custno, prodno, cityname, ...region

  • Data transformation and cleaningThe task of correcting and preparing the data is called data cleaning.

    Data source delivers data into the database of data warehouse it should be corrected.

    *

  • Update of data Update on tables at the data sources must be sent to the data warehouse.

    If the tables in data warehouse are same as sources, the updation is easy.*

  • Summarizing dataThe raw data generated by a transaction may be too large to store online.Therefore, we can use summary of transactions for easy querying.*

  • *Data Warehouse for Decision Support & OLAPPutting Information technology to help the knowledge worker make faster and better decisionsWhich of my customers are most likely to go to the competition?What product promotions have the biggest impact on revenue?How did the share price of software companies correlate with profits over last 10 years?

  • *Decision SupportUsed to manage and control businessData is historical or point-in-timeOptimized for inquiry rather than updateUse of the system is loosely defined and can be ad-hocUsed by managers and end-users to understand the business and make judgments

  • OLAP(Online analytical processing)A data warehouse stores data , but OLAP transform the data warehouse data into specific meaningful information.Therefore OLAP provides a user friendly environment for interactive data analysis.*

  • OLAP*DATA WAREHOUSEOLAP SERVERFRONT END TOOLUserResultResult setRequestSQL

  • OLAP OPERATION on the multidimensional dataRoll-up(GROUP)Drill down(Less)Slice and Dice(Pice)Pivot(rotate)*

  • TYPES OF OLAPMOLAP(MULTIDIMENSIONAL OLAP)

    ROLAP(RELATIONAL ROLAP)*

  • *Dimensions: Product, Region, TimeHierarchical summarization paths

    Product Region TimeIndustry Country Year

    Category Region Quarter

    Product City Month Week Office DayMulti-dimensional DataHeyI sold $100M worth of goods

  • *Data Warehouse Architecture

  • Architecture of data warehousing*External dataData AcquisitionData ManagerWarehouse dataExternal dataData DictionaryInformation DirectioryWarehouse data

    MiddlewareDesignManagementData Access

  • Architecture of *

  • *Design ComponentThe data warehouse designer design the database of the data warehouse and the warehouse administrator manages the data warehouse.The designer and administrator use the design component to design and store data

  • Types of designBottom-up designBusiness value can be returned as quickly as the first data marts can be created Top-down designAtomic data, that is, data at the lowest level of detail, are stored in the data warehouse.

    Hybrid design*

  • Hybrid design. Hybrid methodologies have evolved to take advantage of the fast turn-around time of bottom-up design and the enterprise-wide data consistency of top-down design.

    *

  • Data Manager ComponentThe database in the data warehouse uses the data manager component for managing and accessing the data stored in the data warehouse.

    RdbmsMdbms*

  • Management ComponentAdministering data acquisition operationManaging backup copies of the dataRecovering the lost data Providing security to the data stored in the data warehouse.Authorizing access to the data stored in the data warehouse.*

  • Data Acquisition ComponentThis component acquires data from various sources by using the data acquisition applicationsThe data acquisition applications are based on rules that are defined by the data warehouse developers.*

  • The operation performed during data clean upRestructuring the records and fields of the database tables.Removing the irrelevant and redundant dataobtaining and adding missing data.Verifying integrity and consistency of the data*

  • The operation performed on the data for enhancement areDecoding and translating the values in fields.Summarizing dataCalculating the derived values.*

  • Information directory ComponentThis component helps the end users to know the details of the data stored in the data warehouse.This is done with the help of the data about the data named meta data.Technical dataBusiness data*

  • Middleware ComponentThis components connect to the local databases.Analytical server used to analyze multidimensional data.Intelligent data warehousing middleware to control the access to the warehouse database.*

  • Data martData mart is a database that contains data needed for a small group of users for their own department needs.Dependent data martIndependent data mart

    *

  • Different between data warehouse and data mart*

    Data warehouseData MartData mart is therefore useful for small organizations with very few departmentsdata warehousing is suitable to support an entire corporate environment.If you listen to some vendors, you may be left thinking that building data warehouses is a waste of time. data mart vendor that tells you this are looking out for their own best interests. This supports the entire information requirement of an organization.This support the information requirement of a department in an organizationThis has large model, wider implementation, large data and more number of users.This has small data model, shorter implementation, less data and some users.

  • Advantages of data martSince each department has its own data mart, the departments can summarize, sort , select structure etc their own departments data. This will not confused with any other department.The department can do whatever DSS processing they want. The processing cost and storage are less that the data warehouse.The department can select a software for their data mart. it is powerful to fit their needs.*

  • Data warehousing life cycle*DesignEnhanceprototypeOperatedeploy

  • *Dimensions: Product, Region, periodsHierarchical summarization paths

    Product Region PeriodIndustry Country Year

    Category Region Quarter

    Product City Month Week Office DayData Modeling(Multi-dimensional Database)HeyI sold $100M worth of goods

  • Building of data warehouseThe builder must forecast the usage of the warehouse by the users.The design should support accessing data with any meaningful values of the attributes.To build a good data warehouse data acquisition process must follow the steps given flowextract the data from multiple heterogeneous sourcesFormat the data for consistency within the warehouse.The data must be cleaned to ensure validityThe data must be converted from relational ,object oriented ,hierarchy model to a multidimensional model.The data are loaded into the warehouse. Good monitoring tools are necessary to recover from incorrect load.*

  • Data warehouse and viewsData warehouse is a permanent storage of data in multidimensional tables.View are temporarily created when needed using data warehouse.This is used for decision support system.*

  • Different between data warehouse and views*

    Data warehouseViewsData warehouse is a permanent storage data.Views are created from warehouse data when needed and it is not permanentData warehouse are multidimensionalViews are relationalData warehouse can be indexed to maximize performance.Views cannot be indexed.Data warehouse provides specific support to a functionalityViews cannot give specific support to a functionality.Data warehouse provide large amount of data.Views are created by extracting minimum data from data warehouse.

  • Data warehouse FutureNew techniques must be introduced in data cleaning ,indexing and partitioning.The manual operation involved in data acquisition ,management data quality and performance maximization must be automated.Proper business rules must be developed and incorporated in warehouse creation and maintenance process.

    *

  • Data MiningData mining is sorting through data to identify patterns and establish relationships.

    *

  • *Data Mining (cont.)

  • *Data Mining works with Warehouse DataData Warehousing provides the Enterprise with a memory

    Data Mining provides the Enterprise with intelligence

  • *The key in business is to know something that nobody else knows. Aristotle Onassis

    To understand is to perceive patterns. Sir Isaiah BerlinData Mining Motivation

  • *Application AreasIndustryApplicationFinanceCredit Card AnalysisInsuranceClaims, Fraud AnalysisTelecommunicationCall record analysisConsumer goodspromotion analysisData Service providersValue added dataUtilitiesPower usage analysis

  • *Data Mining in UseThe US Government uses Data Mining to track fraudA Supermarket becomes an information brokerBasketball teams use it to track game strategyCross SellingWarranty claims RoutingHolding on to Good CustomersWeeding out Bad Customers

  • *What is data mining technologyThe process of extracting or finding hidden knowledge from large database is called data mining.

    Ex: Age 21------ we can understand he is major

    datainformation

  • Data Mining Technology*Cleaning and IntegrationDatabasesData WarehouseFlat FilesPatternsKnowledgeSelection and transformationData Mining

  • The various stepData cleaning To remove noise and inconsistent dataData integration Data from multiple sources are combinedData selection relevant data are retrieved from the database for analysis*

  • Data transformation The selected data are made for mining by performing aggregation operationsData mining Intelligent methods are applied to extract data patternsPattern evaluation Identify the needed patternsKnowledge presentation present the mined knowledge to the user*

  • Loading the WarehouseCleaning the data before it is loaded

  • *Data Integration Across SourcesTrustCredit cardSavingsLoansSame data different nameDifferent data Same nameData found here nowhere elseDifferent keyssame data

  • *Data Transformation Exampleencodingunitfieldappl A - balanceappl B - balappl C - currbalappl D - balcurrappl A - pipeline - cmappl B - pipeline - inappl C - pipeline - feetappl D - pipeline - ydsappl A - m,fappl B - 1,0appl C - x,yappl D - male, femaleData Warehouse

  • Structuring/Modeling Issues

  • Data Warehouse vs. Data Marts

  • *From the Data Warehouse to Data Marts

  • *Data Warehouse and Data MartsOLAPData MartLightly summarizedDepartmentally structuredOrganizationally structuredAtomicDetailed Data Warehouse Data

  • *Characteristics of the Departmental Data MartOLAPSmallFlexibleCustomized by DepartmentSource is departmentally structured data warehouse

  • *Techniques for Creating Departmental Data MartOLAPSubsetSummarizedSupersetIndexedArrayedSalesMktg.Finance

  • *Data Mart CentricData MartsData SourcesData Warehouse

  • *True WarehouseData MartsData SourcesData Warehouse

  • II. On-Line Analytical Processing (OLAP)Making Decision Support Possible

  • *What Is OLAP?Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor SoftwareGenerally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information SystemOLAP = Multidimensional DatabaseMOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express)ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent)

  • *The OLAP Market Rapid growth in the enterprise market1995: $700 Million1997: $2.1 BillionSignificant consolidation activity among major DBMS vendors10/94: Sybase acquires ExpressWay7/95: Oracle acquires Express 11/95: Informix acquires Metacube1/97: Arbor partners up with IBM10/96: Microsoft acquires PanoramaResult: OLAP shifted from small vertical niche to mainstream DBMS category

  • *Strengths of OLAPIt is a powerful visualization paradigmIt provides fast, interactive response timesIt is good for analyzing time seriesIt can be useful to find some clusters and outliersMany vendors offer OLAP tools

  • * OLAP Is FASMIFastAnalysisSharedMultidimensionalInformation

  • *Data Cube LatticeCube lattice ABC AB AC BC A B C noneCan materialize some groupbys, compute others on demandQuestion: which groupbys to materialze?Question: what indices to createQuestion: how to organize data (chunks, etc)

  • *Visualizing Neighbors is simpler

    1

    2

    3

    4

    5

    6

    7

    8

    Apr

    May

    Jun

    Jul

    Aug

    Sep

    Oct

    Nov

    Dec

    Jan

    Feb

    Mar

    Month

    Store

    Sales

    Apr

    1

    Apr

    2

    Apr

    3

    Apr

    4

    Apr

    5

    Apr

    6

    Apr

    7

    Apr

    8

    May

    1

    May

    2

    May

    3

    May

    4

    May

    5

    May

    6

    May

    7

    May

    8

    Jun

    1

    Jun

    2

  • *A Visual Operation: Pivot (Rotate)10473012JuiceColaMilk CreamNYLASF

    3/1 3/2 3/3 3/4DateMonth

    Region

    Product

  • *Slicing and DicingProductSales ChannelRegionsRetailDirectSpecialHouseholdTelecommVideoAudioIndiaFar EastEuropeThe Telecomm Slice

  • *Roll-up and Drill Down

    Sales ChannelRegionCountryState Location AddressSales Representative

  • *Nature of OLAP AnalysisAggregation -- (total sales, percent-to-total)Comparison -- Budget vs. ExpensesRanking -- Top 10, quartile analysisAccess to detailed and aggregate dataComplex criteria specificationVisualization

  • *Organizationally Structured DataDifferent Departments look at the same detailed data in different ways. Without the detailed, organizationally structured data as a foundation, there is no reconcilability of datamarketingmanufacturingsalesfinance

  • *Multidimensional SpreadsheetsAnalysts need spreadsheets that supportpivot tables (cross-tabs)drill-down and roll-upslice and dicesortselectionsderived attributesPopular in retail domain

  • Prentice Hall*OLAP OperationsSingle CellMultiple CellsSliceDiceRoll UpDrill Down

    Prentice Hall

  • *Relational OLAP: 3 Tier DSSStore atomic data in industry standard RDBMS.Generate SQL execution plans in the ROLAP engine to obtain OLAP functionality.Obtain multi-dimensional reports from the DSS Client.

  • *MD-OLAP: 2 Tier DSSMDDB EngineMDDB EngineDecision Support ClientDatabase LayerApplication Logic LayerPresentation LayerStore atomic data in a proprietary data structure (MDDB), pre-calculate as many outcomes as possible, obtain OLAP functionality via proprietary algorithms running against this data.Obtain multi-dimensional reports from the DSS Client.

  • MSPVL Polytechnic CollegePavoorchatram*

    *