Data Warehousing & Mining

~ Arvind Pandi DoraiLecturer, Computer DeptKJSIEIT

Chapter 1IntroductionNEED OF DATA WAREHOUSEIn 1960s, computer systems used to maintain business data.

As enterprises grew larger, hundreds of computer applications needed to support business processes.

In 1990s as businesses grew more complex, corporations spread globally & competition became complex, businesses executives became desperate for information to stay competitive & improve bottom line.

Companies need information to formulate the business strategies, establish goals, set objectives & monitor results

Data WarehouseDefinition: Data warehouse is a relational DB that maintains huge volumes of historical data, so as to support strategic analysis & decision making.

To take a strategic decision, we need strong analysis & for strong analysis we need historical data. Since ERP does not support historical data, DW came into picture.

Data Warehouse Features

Subject oriented - Subject specific data marts.

Integrated - Data integrated into single uniform format.

Time Variant - DW maintains data over a wide range of time.

Non volatile - Data is never deleted, Rarely updated.

Data Warehouse ObjectsDimension Tables:

Fact Tables:

Dimension Table KeyWideTextual AttributesDenormalised Drill-down & Roll-upMultiple Hierarchies

Foreign keyDeepNumeric factsTransaction level dataAggregate data

Star SchemaA large and central fact table and one table for each dimension.Every fact points to one tuple in each of the dimensions and has additional attributes.Does not capture hierarchies directly.De-normalized system.Easy to understand, easy to define hierarchies, reduces no. of joins.

Star Schema layout

Star Schema Example

SnowFlake SchemaVariant of star schema model.A single & large and central fact table and one or more tables for each dimension.Dimension tables are normalized i.e. split dimension table data into additional tables.Process of making a snowflake schema is called snowflaking.Drawbacks: Time consuming joins, report generation slow.

Snowflake Schema Layout

Fact ConstellationMultiple fact tables share dimension tables.This schema is viewed as collection of stars hence called galaxy schema or fact constellation or family of stars.Sophisticated application requires such schema.

Fact ConstellationStore DimensionProduct DimensionSalesFact TableShippingFact Table

Store KeyProduct KeyPeriod KeyUnitsPrice

Store KeyStore NameCityStateRegion

Product KeyProduct Desc

Shipper KeyStore KeyProduct KeyPeriod KeyUnitsPrice

Chapter 2MetadataMeta Data: Data about dataTypes of Metadata:Operational MetadataExtraction &Transformation MetadataEnd-User Metadata

Information PackageIP gives special significance to dimension hierarchy in the business dimension & the key facts in the fact table.

Chapter 3DW Architecture

DW ArchitectureData AcquisitionData ExtractionData TransformationData Staging

Data StorageData LoadingData Aggregation

Information DeliveryReportOLAPData Mining

Data AcquisitionData Extraction: Immediate Data ExtractionDeferred Data Extraction

Data Transformation: Splitting up of cellsMerging up of cellsDecoding of fieldsDe-duplicationDate-Time format conversionComputed or derived fields

Data Staging

Data StorageData Loading: Initial Loading Incremental Loading

Data Aggregation:Based on fact tablesBased on aggregate tables

Information DeliveryReports Aggregate data

OLAP Multidimensional Analysis

Data Mining Extracting knowledge from database

Chapter 4Principles of Dimensional ModelingDimensional Modeling:Logical Design technique to structure{arrange} the business dimensions & the fact tables.DM is a technique to prepare a star schema.Provides best data access.Fact table interacts with each & every business dimension.Drill-down & Roll-up.

Fully Additive Facts:When the values of an attribute are summed up by simple addition to provide some meaningful data, it is known as fully additive facts.Semi Additive Facts:When the values of an attribute are summed up, but it does not provide meaningful data, but when some mathematical operations are performed on it to provide meaningful data, it is known as fully additive facts.Factless Fact table:A fact table in which numeric facts are absent.

Chapter 5Information Access & Delivery

OLAP is a technique that allows user to view aggregate data across measurements along with a set of related dimension.OLAP supports multidimensional analysis because data is stored in multidimensional array.

OLAP OperationsSlice: Filtering the OLAP cube, view 1 attribute.Dice: Viewing two attributes.Drill-down: Detailing or expanding an attribute values.Roll-up: Aggregating or compressing an attribute values.Rotate: Rotating the cube to view different dimensions.

OLAP OperationsSlice and DiceTimeRegionProductProduct= iPodTimeRegion

OLAP OperationsDrill DownTimeRegionProductCategory e.g Music PlayerSub Category e.g MP3Product e.g iPod

OLAP OperationsRoll UpTimeRegionProductCategory e.g Music PlayerSub Category e.g MP3Product e.g iPod

OLAP OperationsPivotTimeRegionProductRegionTimeProduct

OLAP ServerAn OLAP Server is a high capacity, multi-user data manipulation engine specifically designed to support and operate on multi-dimensional data structure.OLAP server available areMOLAP serverROLAP serverHOLAP server

Chapter 6Implementation & MaintenanceIMPLEMENTATION: Monitoring: Sending data from sourcesIntegrating: Loading, cleansing, ...Processing: Query processing, indexing, ...Managing: Metadata, Design, ...

MaintainenceMaintenance is an issue for materialized viewsRecomputation Incremental updating

View and Materialized ViewsViewDerived relation defined in terms of base (stored) relations.Materialized viewsA view can be materialized by storing the tuples of the view in the database.Index structures can be built on the materialized view.

OverviewExtracting knowledgePerform analysisUse DM Algorithms

Knowledge Discovery in Database

Steps In KDD ProcessData Cleaning Data IntegrationData SelectionData TransformationData miningPattern EvaluationKnowledge Presentation

Architecture of DM

DM AlgorithmsAssociation:Relationship between item sets.Used in Market basket analysis.Eg: Apriori & FP TreeClassification:Classify each item to predefined groups.Eg: Nave Bayesian & ID3Clustering:Each item divided into dynamically generated groups.Eg: K-means & K-mediods

Example: Market Basket DataItems frequently purchased together:Computer PrinterUses:Placement AdvertisingSalesCouponsObjective: increase sales and reduce costsCalled Market Basket Analysis, Shopping Cart Analysis

Transaction Data: Supermarket DataMarket basket transactions:t1: {bread, cheese, milk}t2: {apple, jam, salt, ice-cream} tn: {biscuit, jam, milk}Concepts:An item: an item/article in a basketI: the set of all items sold in the storeA Transaction: items purchased in a basket; it may have TID (transaction ID)A Transactional dataset: A set of transactions

Association Rule DefinitionsAssociation Rule (AR): implication X Y where X,Y I and X Y = ;

Support of AR (s) X Y: Percentage of transactions that contain X Y

Confidence of AR (a) X Y: Ratio of number of transactions that contain X Y to the number that contain X

Association Rule ProblemGiven a set of items I={I1,I2,,Im} and a database of transactions D={t1,t2, , tn} where ti={Ii1,Ii2, , Iik} and Iij I, the Association Rule Problem is to identify all association rules X Y with a minimum support and confidence.

Link Analysis

Association Rule Mining TaskGiven a set of transactions T, the goal of association rule mining is to find all rules having support minsup thresholdconfidence minconf thresholdBrute-force approach:List all possible association rulesCompute the support and confidence for each rulePrune rules that fail the minsup and minconf thresholds

ExampleTransaction dataAssume:minsup = 30%minconf = 80%An example frequent itemset: {Cocoa, Clothes, Milk} [sup = 3/7]Association rules from the itemset: Clothes Milk, Cocoa[sup = 3/7, conf = 3/3] Clothes, Cocoa Milk, [sup = 3/7, conf = 3/3]t1:Butter, Cocoa, Milkt2:Butter, Cheeset3:Cheese, Bootst4:Butter, Cocoa, Cheeset5:Butter, Cocoa, Clothes, Cheese, Milkt6:Cocoa, Clothes, Milkt7:Cocoa, Milk, Clothes

Mining Association RulesTwo-step approach: Frequent Itemset GenerationGenerate all itemsets whose support minsupRule GenerationGenerate high confidence rules from each frequent itemset, where each rule is a binary partitioning of a frequent itemsetFrequent itemset generation is still computationally expensive

Step:1 Generate Candidate & Frequent Item Sets

Let k=1Generate frequent itemsets of length 1Repeat until no new frequent itemsets are identifiedGenerate length (k+1) candidate itemsets from length k frequent itemsetsPrune candidate itemsets containing subsets of length k that are infrequent Count the support of each candidate by scanning the DBEliminate candidates that are infrequent, leaving only those that are frequent

Apriori Algorithm Example

Step 2: Generating Rules From Frequent ItemsetsFrequent itemsets association rulesOne more step is needed to generate association rulesFor each frequent itemset X, For each proper nonempty subset A of X, Let B = X - AA B is an association rule ifConfidence(A B) minconf,support(A B) = support(AB) = support(X) confidence(A B) = support(A B) / support(A)

Generating Rules: An exampleSuppose {2,3,4} is frequent, with sup=50%Proper nonempty subsets: {2,3}, {2,4}, {3,4}, {2}, {3}, {4}, with sup=50%, 50%, 75%, 75%, 75%, 75% respectivelyThese generate these association rules:2,3 4, confidence=100%2,4 3, confidence=100%3,4 2, confidence=67%2 3,4, confidence=67%3 2,4, confidence=67%4 2,3, confidence=67%All rules have support = 50%

Rule GenerationGiven a frequent itemset L, find all non-empty subsets f L such that f L f satisfies the minimum confidence requirementIf {A,B,C,D} is a frequent itemset, candidate rules:ABC D, ABD C, ACD B, BCD A, A BCD,B ACD,C ABD, D ABC AB CD,AC BD, AD BC, BC AD, BD AC, CD AB, If |L| = k, then there are 2k 2 candidate association rules (ignoring L and L)

Generating RulesTo recap, in order to obtain A B, we need to have support(A B) and support(A)

All the required information for confidence computation has already been recorded in itemset generation. No need to see the data T any more.

This step is not as time-consuming as frequent itemsets generation.

Rule GenerationHow to efficiently generate rules from frequent itemsets?In general, confidence does not have an anti-monotone propertyc(ABC D) can be larger or smaller than c(AB D)But confidence of rules generated from the same itemset has an anti-monotone propertye.g., L = {A,B,C,D}: c(ABC D) c(AB CD) c(A BCD)

Apriori Advantages/DisadvantagesAdvantages:Uses large itemset property.Easily parallelizedEasy to implement.Disadvantages:Assumes transaction database is memory resident.Requires up to m database scans.

Mining Frequent Patterns Without Candidate GenerationCompress a large database into a compact, Frequent-Pattern tree (FP-tree) structurehighly condensed, but complete for frequent pattern miningavoid costly database scansDevelop an efficient, FP-tree-based frequent pattern mining methodA divide-and-conquer methodology: decompose mining tasks into smaller onesAvoid candidate generation: sub-database test only!

Construct FP-tree From A Transaction DBmin_support = 0.5TIDItems bought (L-order) freq items100{f, a, c, d, g, i, m, p}{f, c, a, m, p}200{a, b, c, f, l, m, o}{f, c, a, b, m}300 {b, f, h, j, o}{f, b}400 {b, c, k, s, p}{c, b, p}500 {a, f, c, e, l, p, m, n}{f, c, a, m, p}Steps:Scan DB once, find frequent 1-itemset (single item pattern)Order frequent items in frequency descending orderScan DB again, construct FP-tree

Benefits of the FP-tree StructureCompleteness: never breaks a long pattern of any transactionpreserves complete information for frequent pattern miningCompactnessreduce irrelevant informationinfrequent items are gonefrequency descending ordering: more frequent items are more likely to be sharednever be larger than the original database (if not count node-links and counts)

Mining Frequent Patterns Using FP-treeGeneral idea (divide-and-conquer)Recursively grow frequent pattern path using the FP-treeMethod For each item, construct its conditional pattern-base, and then its conditional FP-treeRepeat the process on each newly created conditional FP-tree Until the resulting FP-tree is empty, or it contains only one path (single path will generate all the combinations of its sub-paths, each of which is a frequent pattern)

Major Steps to Mine FP-treeConstruct conditional pattern base for each node in the FP-treeConstruct conditional FP-tree from each conditional pattern-baseRecursively mine conditional FP-trees and grow frequent patterns obtained so farIf the conditional FP-tree contains a single path, simply enumerate all the patterns

Step 1: FP-tree to Conditional Pattern BaseStarting at the frequent header table in the FP-treeTraverse the FP-tree by following the link of each frequent itemAccumulate all of transformed prefix paths of that item to form a conditional pattern baseConditional pattern basesitemcond. pattern basecf:3afc:3bfca:1, f:1, c:1mfca:2, fcab:1pfcam:2, cb:1

Step 2: Construct Conditional FP-tree For each pattern-baseAccumulate the count for each item in the baseConstruct the FP-tree for the frequent items of the pattern basem-conditional pattern base:fca:2, fcab:1All frequent patterns concerning mm, fm, cm, am, fcm, fam, cam, fcam

Mining Frequent Patterns by Creating Conditional Pattern-Bases

Step 3: Recursively mine the conditional FP-treeCond. pattern base of am: (fc:3)Cond. pattern base of cm: (f:3){}f:3cm-conditional FP-treeCond. pattern base of cam: (f:3){}f:3cam-conditional FP-tree

Single FP-tree Path GenerationSuppose an FP-tree T has a single path PThe complete set of frequent pattern of T can be generated by enumeration of all the combinations of the sub-paths of P{}f:3c:3a:3m-conditional FP-treeAll frequent patterns concerning mm, fm, cm, am, fcm, fam, cam, fcam

Classification Given old data about customers and payments, predict new applicants loan eligibility.AgeSalaryProfessionLocationCustomer typePrevious customersClassifierDecision treeSalary > 5 KProf. = ExecNew applicants datagood/

bad

Overview of Naive BayesThe goal of Naive Bayes is to work out whether a new example is in a class given that it has a certain combination of attribute values. We work out the likelihood of the example being in each class given the evidence (its attribute values), and take the highest likelihood as the classification.

Bayes Rule: E- Event has occurred

P[H] is called the prior probability (of the hypothesis). P[H|E] is called the posterior probability (of the hypothesis given the evidence)

ID3 (Decision Tree Algorithm)ID3 was the first proper decision tree algorithm to use this mechanism:

Building a decision tree with ID3 algorithmSelect the attribute with the most gainCreate the subsets for each value of the attributeFor each subsetif not all the elements of the subset belongs to same class repeat the steps 1-3 for the subset

ID3 (Decision Tree Algorithm)Function DecisionTtreeLearner(Examples, Target_Class, Attributes) create a Root node for the tree if all Examples are positive, return the single-node tree Root, with label = Yes if all Examples are negative, return the single-node tree Root, with label = No if Attributes list is empty, return the single-node tree Root, with label = most common value of Target_Class in Examples else A = the attribute from Attributes with the highest information gain with respect to Examples Make A the decision attribute for Root for each possible value v of A:add a new tree branch below Root, corresponding to the test A = v let Examples_v be the subset of Examples that have value v for attribute A if Examples_v is empty then add a leaf node below this new branch with label = most common value of Target_Class in Exampleselse add the subtree DTL(Examples_v, Target_Class, Attributes - { A })end ifendreturn Root

Decision Trees (Summary)Advantages of ID3automatically creates knowledge from datacan discover new knowledge (watch out for counter-intuitive rules)avoids knowledge acquisition bottleneckidentifies most discriminating attribute firsttrees can be converted to rulesDisadvantages of ID3several identical examples have same effect as a single exampletrees can become large and difficult to understandcannot deal with contradictory examplesexamines attributes individually: does not consider effects of inter-attribute relationships

CLUSTERINGCluster: a collection of data objectsSimilar to one another within the same clusterDissimilar to the objects in other clusters

Cluster analysisGrouping a set of data objects into clusters

Clustering is unsupervised classification: no predefined classes

Typical applicationsAs a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

Partitional ClusteringNonhierarchicalCreates clusters in one step as opposed to several steps.Since only one set of clusters is output, the user normally has to input the desired number of clusters, k.Usually deals with static sets.

K-MeansInitial set of clusters randomly chosen.Iteratively, items are moved among sets of clusters until the desired set is reached.High degree of similarity among elements in a cluster is obtained.Given a cluster Ki={ti1,ti2,,tim}, the cluster mean is mi = (1/m)(ti1 + + tim)

K-Means ExampleGiven: {2,4,10,12,3,20,30,11,25}, k=2Randomly assign means: m1=3,m2=4K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25Stop as the clusters with these means are the same.

Hierarchical ClusteringClusters are created in levels actually creating sets of clusters at each level.Agglomerative:Initially each item in its own clusterIteratively clusters are merged togetherBottom UpDivisive:Initially all items in one clusterLarge clusters are successively dividedTop Down

Hierarchical ClusteringUse distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition

The K-Medoids Clustering MethodFind representative objects, called medoids, in clustersPAM (Partitioning Around Medoids,)starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clusteringHandles outliers well.Ordering of input does not impact results.Does not scale well.Each cluster represented by one item, called the medoid.Initial set of k medoids randomly chosen.PAM works effectively for small data sets, but does not scale well for large data sets

PAM (Partitioning Around Medoids)PAM - Use real object to represent the clusterSelect k representative objects arbitrarilyFor each pair of non-selected object h and selected object i, calculate the total swapping cost TCihFor each pair of i and h, If TCih < 0, i is replaced by hThen assign each non-selected object to the most similar representative objectrepeat steps 2-3 until there is no change

Web Mining

CrawlersRobot (spider) traverses the hypertext structure in the Web.Collect information from visited pagesUsed to construct indexes for search enginesTraditional Crawler visits entire Web and replaces indexPeriodic Crawler visits portions of the Web and updates subset of indexIncremental Crawler selectively searches the Web and incrementally modifies indexFocused Crawler visits pages related to a particular subject

Web Usage MiningPerforms mining on Web Usage data or Web LogsA web log is a listing of page reference data also called as a click steamCan be seen from either server perspective better web site designOr client perspective prefetching of web pages etc.

Web Usage Mining ApplicationsPersonalizationImprove structure of a sites Web pagesAid in caching and prediction of future page referencesImprove design of individual pagesImprove effectiveness of e-commerce (sales and advertising)

Web Usage Mining ActivitiesPreprocessing Web logCleanse Remove extraneous informationSessionizeSession: Sequence of pages referenced by one user at a sitting.Pattern DiscoveryCount patterns that occur in sessionsPattern is sequence of pages references in session.Similar to association rulesTransaction: sessionItemset: pattern (or subset)Order is importantPattern Analysis

Web Structure MiningMine structure (links, graph) of the WebTechniquesPageRankCLEVER

Create a model of the Web organization.May be combined with content mining to more effectively retrieve important pages.

Web as a Graph Web pages as nodes of a graph.Links as directed edges.www.uta.edu

my pagewww.uta.edu www.google.com www.google.com my pagewww.uta.edu www.google.com

Link Structure of the Web Forward links (out-edges).Backward links (in-edges).Approximation of importance/quality: a page may be of high quality if it is referred to by many other pages, and by pages of high quality.

PageRankUsed by GooglePrioritize pages returned from search by looking at Web structure.Importance of page is calculated based on number of pages which point to it Backlinks.Weighting is used to provide more importance to backlinks coming form important pages.

HITS AlgorithmUsed to generate good quality authoritative pages and hub pagesAuthoritative Page: A page pointed by many other pages.Hub Page: A page which points to an authoritative page.

HITS AlgorithmStep 1: Generate Root setStep 2: Generate Base setStep 3: Build GraphStep 4: Retain external links & eliminate internal linksStep 5: Calculate Authoritative & Hub scoreStep 6: Generate result

*~ APD*******************Han and Kamber 2001*************************************

Documents

Data Warehousing & Mining