View
234
Download
0
Embed Size (px)
Citation preview
Data Mining ConceptsData Mining Concepts
Advanced Database Advanced Database Management SystemsManagement Systems
What is data mining?What is data mining?
• It refers to the mining or discovery of new information in terms of patterns or rules from vast amount of data.
• It should be carried out efficiently on larger files or database.
• But, it is not well integrated with DBMS.
• Types of Knowledge Discovered during data mining:– Two types of knowledge
• Deductive knowledge : deduces new information based on pre-specified logical rules of deduction on the given data.
• Inductive knowledge : discovers new rules and pattern from supplied data.
– Knowledge can be represented in• Unstructured – represented by rules or propositional Logic
• Structured – represented in decision tree, semantic networks, neural networks.
• Result of mining may be to discover new information:
• Association Rules
• Sequential patterns
• Classification trees
• Goals of Data mining• Prediction
• Optimization
• Classification
• Identification
Association RulesAssociation Rules
• Market Based Model, Support and Confidence
• Apriori Algorithm
• Sampling Algorithm
• Frequent – Pattern Algorithm
• Partition Algorithm
Market Based model, Support and Market Based model, Support and confidenceconfidence
• Major technology in data mining involves discovery of association rules.
• XY where X ={x1,x2………,xn}, Y={y1,y2,….,yn}
• If rule: LHSRHS then item set := LHS U RHS
• Association rule should satisfy some interest measures: Support and Confidence
• Support for rule LHSRHS with respect to the itemset. -It refers to how frequently a specific itemset occurs in a database-is the percentage of transactions that contain all of the
items in the itemset.-support is sometimes called “prevalence” of the rule.
• Confidence of implication shown in rule– is computed as Support(LHS U RHS)/Support(LHS).-is the probability that the items in RHS will be purchased
given the item in LHS are purchased.-confidence is also called the “strength” of the rule.
• Generate all rules that exceed user specified confidence and support thresholds by:
(a) Generate all itemsets of support>threshold (called “large” itemsets)
(b) For each large itemset, generate rules of min confidence, by: For large itemset X and Y subset of X, let Z=X-Y. If support(X)/support(Z)>min_confidence THEN Z Y is valid rule.
Problem: combinatorial explosion on # of itemsets
• To solve combinatorial explosion use properties:
-download closure (A subset of large itemset must also be large, i.e. each subset of a large itemset exceeds the min required support)
-antimontonicity ( A superset of small itemset is also small:. It does not have enough support) .
Apriori Algorithm
Algorithm for finding large itemsets
• Let minimum support threshold be 0.5
• Transaction –ID Items- Bought
101 milk, bread, cookies, juice
792 milk, juice
1130 milk, egg
1735 bread, cookies, coffee.
• Candidate 1-itemsets are C1 = {milk, bread, juice, cookies, eggs, coffee}with respective supports = {0.75,0.5, 0.5,0.5,0.25.0.25}
• Frequent 1-itemset is L1 = {milk, bread, juice, cookies} since support >= 0.5.
• So, candidate 2-itemsets are: C2 = {{milk, bread}, {milk, juice}, {milk, cookies}, {bread, juice}, {bread, cookies}, {juice, cookies}}, with supports = {0.25, 0.5, 0.25, 0.25, 0.5, 0.25}
• So, frequent 2-itemsets are: L2 = {{milk, juice}, {bread, cookies}} > =0.5
• Next, construct candidate frequent 3-itemsets by adding additional items to sets in L2.
• For example, {milk, juice, bread}. But {milk, bread} is not a frequent 2-itemset in L2, so by downward closure property {milk, juice, bread} cannot be a frequent 3-itemset.
• Here all 3-extensions will fail, so process terminates
Sampling AlgorithmSampling Algorithm((for VLDB-very large databases)for VLDB-very large databases)
• Sampling algorithm selects a sample of the database transactions, small enough to fit in main memory, and then determines the frequent itemsets from that sample.
• If these frequent itemsets form a super set of the frequent itemsets for the entire database, then the real frequent itemsets can be determined by scanning the remainder of the database in order to compute the exact support values for the superset itemsets.
• A superset of the frequent itemsets can be found from the sample by using Apriori algorithm with a lowered minimum support.
• In some cases frequent item sets may be missed, so the concept of negative border is used to decide if any were missed.
• The basic idea is that the negative border of a set of frequent itemsets contains the closest itemsets that could also be frequent.
• Negative border for itemset S and set of items I, is the minimal itemsets contained in Powerset(I) and not in S.
(continued)
• Consider set of Items I = {A,B,C,D,E}• Let combined frequent itemsets of size 1 to 3 be S=
{A,B,C,D,AB,AC,BC,AD,CD,ABC}• Then Negative Border = {E,BD,ACD}• Where {E} is the only 1-itemset not contained in S, {BD}
is the only 2-itemset not in S but whose 1-itemset subsets are, and ACD is the only 3-itemset not in S whose 2-itemset subsets are all in S.
• Scan remaining database to find support of negative border. If an itemset X is found in the negative border which belongs to set of all frequent itemsets, then maybe a superset of X could also be frequent, which can be determined by a second pass over the database.
Frequent Pattern Tree algorithmFrequent Pattern Tree algorithm• It is improves the Apriori algorithm by reducing
number of candidate itemsets that need to be generated and tested whether they are frequent.
• First produces a compressed version of the database in the form of a Frequent Pattern (FP)- tree
• The FP-tree stores relevant itemset information and allows for efficient discovery of frequent itemsets.
• Divide-and-conquer strategy: mining is decomposed into a set of smaller tasks that each operate on a smaller, conditional FP-tree which is a subset of original tree.
• Database is first scanned and frequent 1-itemsets with their support are computed.
• Here, support is defined as the count of transactions containing the item rather than fraction of transactions that contain it as in Apriori.
• To construct the FP-tree from transaction table, for a minimum suppor of say =2:-Frequent 1-itemsets are stored in Non increasing order of support{{(milk,3)},{(bread,2)},{(cookies,2)},{(juice,2)}}-For each transaction, construct sorted list of its frequent items and expand tree as needed and update the frequent item index table:First transaction sorted list is T = {milk,bread,cookies,juice}Second Transaction={milk, juice}Third transaction= {milk}(since eggs is not a frequent item)Fourth transaction={bread, cookies}
Resulting FP-tree is as follows:
NULL
Cookies:1
Bread:1
Milk:3
Juice:1
Juice:1
FP-tree for minimum support equal to 2. FP represents the original transactions in a compressed format.
bread:1
Cookies:1
item support Link
Milk 3
Bread 2
Cookies 2
Juice 2
• Given the FP-tree and a minimum support, s, the FP-growth algorithm is used to find frequent itemsets:
• Let s=null;method FP-growth(original_tree, s)IF tree contains a single path P then
for each combo, b, of the nodes in the pathgenerate pattern (b U s) with support=minimum support
of nodes in b;ELSE
for each item, I, in reverse order of items in frequent itemset list, do{generate pattern b=(I U s) with support=I.suport; construct the conditional pattern base for b, following links in FP-tree; //[example for b=juice: (milk, bread, cookies) and (milk)] construct b’s conditional FP-tree, beta_tree, by keeping only items of support greater than min_support;//example for b=juice, beta_tree only has milk:2 as node, since cookies and //bread
have support=1<2 If beta_tree is not empty then recursively call FP-growth(beta_tree, b);}
-----------------------------------------------------------------------------------------
Result of FP-growth algo for minimum support of 2: frequent itemsets are = { (milk:3), (bread:2), (cookies:2), (juice:2), (milk, juice:2), (bread, cookies:2)}
Partition AlgorithmPartition Algorithm
• If a database has a small number of potential frequent itemsets then their support can be found in one scan using Partitioning techniques.
• Partitioning divides the database into non overlapping subsets.
• It can be accommodated in main memory.• Partition is read only once in each pass.• Support used is different from the original value.• Global candidate Large itemset that are identified in pass 1
are verified in pass 2 with support measured for entire database.
• At the end, all global large itemsets are identified.
ClassificationClassification
• Learning a model that describes different classes of data.
• Classes are pre-determined.
• Each model that are designed will be in the form of decision tree or set of rules.
• Decision tree is constructed from the training data set
Algorithm for decision tree induction
INPUT :Set of training records R1,R2,….Rm and set of Attributes A1,A2,…..An.
Procedure Build_tree (Records ,Attributes);
BEGIN
Create a node N;If all records belong to the same class ,C thenReturn N as a leaf node with class label C;If Attributes is empty then Return N as a leaf node with class label C , such that the majority ofRecords belong to it;Select attribute Ai (with the highest information again) from Attributes;Label node N with Ai;For each known value , Vj of Ai do
begin
add a branch from node N for the condition Ai = Vj;
Sj = subset of Records where Ai = Vj;
if Sj is empty then
add a leaf , L, with class lable C ,such that the majority of Records belong to it and Return L
else add the node returned by Build_tree (Sj , Attributes , Aj );
end;
End;
• Eg : customer who apply for credit card may be classified as “poor risk”, “fair risk”, “good risk”.
• If customer is married, salary>=50k then good risk.this rules describes class “good risk”.
Salary
Married
Acct balance
agePoor risk
Poor risk fair risk good risk
fair risk good risk
yes no
<20K>=20k <50k
>=50k <5k>=5k
<25 >=25
ClusteringClustering
• The goals of clustering is to place records into groups, such that records in a group are similar to each other and dissimilar to records in other groups.
• Groups are usually disjoint.• Important facet of clustering is similarity
function that is used. • If data is numeric, similarity function based
on distance is used.
K-means clustering algorithm
Input:Input: A database D ,of m records r1,r2,…..rm and a desire number
of clusters k
BeginBegin
randomly choose k records as the centroids for the k clusters;
RepeatRepeat
assign each record , ri, to a cluster such that the distance between ri and the cluster centroid (mean) is the smallest among the
k clusters; recalculate the centroid (mean) for each cluster based on the records
assigned to the cluster;until no change;
End;End;
Approaches to other data mining problemApproaches to other data mining problem
• Discovery of sequential Pattern
• Discovery of patterns in Time series
• Regression
• Neural Networks
• Genetic Algorithms
Application of Data miningApplication of Data mining• MarketingMarketing Analysis of consumer behaviour based on buying patterns.• FinanceFinance Analysis of creditworthiness of clients, finance investments like stocks, bonds, mutual funds.• ManufacturingManufacturing Optimization of resources like machines, manpower and materials.• Health CareHealth Care Discovering patterns in radiological images, analyzing Side effect of drugs
What is Data Warehousing?
• Collection of information as well as supporting system.
• Designed precisely to support efficient extraction, processing and presentation for analytic and decision making purpose
Characteristics of Data Warehouse
• Multi dimensional conceptual view• Generic dimensionality• Unlimited dimensions and aggregation levels• Unrestricted cross – dimensional operations• Client-server architecture• Multi user support• Accessibility• Transparency• Intuitive data manipulation• Consistent reporting performance• Flexible reporting
Data modeling for data warehouses
• Multidimensional models – populate data in multidimensional matrices called data cubes.
– Hierarchical views• Roll up display• Drill down display
– Tables• Dimension table - tuples• Fact table – attributes.
– Schemas• Star schema• Snow flake schema
Building Data Warehouse
• Specifically support ad-hock querying
• Factors:– Data -Extracted from multiple, heterogeneous
sources– Formatted for consistency within the warehouse– Cleaned to validity– Fitted into the data model– Loaded into warehouse
Functionality of Data Warehouse
• Roll - up
• Drill – down
• Pivot
• Slice and Dice
• Sorting
• Selection
• Derived attributes
Difficulties of implementing Data warehouse
• Operational issues with data warehousing
– Construction– administration – quality control.
Data Mining versus Data Data Mining versus Data warehousingwarehousing
• Data warehouse – support decision making with data.
• Data mining with data warehousing help with certain types of decision.
• To make data mining efficient, data warehousing should have a summarized collection of data. Data mining extracts meaningful new patterns just by processing and querying data in data warehouse.
• Hence, in large database running in terra bytes, successful application of data mining depends on construction of data warehouse.
Overview Introduction, Definitions, and Terminology Characteristics of Data Warehouses Data Modeling for Data Warehouses Building a Data Warehouse Typical functionality of a Data Warehouse Data Warehouse vs. Views Problems and Open Issues in Data Warehouses Questions / Comments
Data Warehouse
A data warehouse is also a collection of information as well as supporting system.
They are mainly intended for decision support applications.
Data warehouses provide access to data for complex analysis, knowledge discovery and decision making.
Data warehouses support efficient extraction, processing, and presentation for analytic and decision–making purposes.
Characteristics of Data Warehouse Data warehouse is a store for integrated
data from multiple sources, processed for storage in a multi dimensional model.
Information in data warehouse changes less often and may be regarded as non-real time with periodic updates.
Warehouse update are handled by the warehouse’s acquisition component that provides all required preprocessing.
Overview conceptual structure of data warehouse
cont…
Characteristics of Data Warehouse
OLAP
DSSI
EIS
DATA MINING
Databases Cleaning Reformatting
Back flushing
Other Data Inputs
Updates/New Data
Data Warehouse
Meta
Data
OLAP (Online Analytical Processing) : Is a term used to describe the analysis of complex data from data warehouse.
DSS (decision-support system) also known as EIS (Executive Information System): Supports an organization’s leading decision making with higher level data foe complex and important decisions.
Data mining: The process of searching data for unanticipated new knowledge.
Cont…
Characteristics of Data Warehouse
It has multidimensional conceptual view. It has unlimited dimensions and aggregate levels. Client server architecture. Multi-user support. Accessibility. Transparency. Intuitive data manipulation. Unrestricted cross-dimensional operations. Flexible reporting.
cont…
Characteristics of Data Warehouse
They encompass large volumes of data which is an issue that has been dealt with
Enterprise-wide warehouse: Are huge projects requiring massive investment of time and resources.
Virtual data warehouse: Provides views of operational databases that are materialized for efficient access.
Data marts: Are targeted to a subset of the organization, such as a department and are more tightly focused.
Cont…
Characteristics of Data Warehouse
Data Modeling for Data Warehouses
Data can be populated in multi dimensional matrices called data cubes.
Query processing in multidimensional model can be much better than in the relational data model.
Changing from one dimensional hierarchy to another is easily accomplished in a data cube by a technique called pivoting.
Data cube can be thought of as rotating to show a different orientation of the axes.
Multidimensional model lend themselves readily to hierarchical views in what is known as roll-up display and drill-down display.
Roll-up display moves up the hierarchy, grouping into larger units along a dimension.
Drill-down display gives a fine-grained view. A multidimensional storage model involves
two types of tables dimension table and fact table.
cont..
Data Modeling for Data Warehouses
A dimension table consists of tuples of attributes of the dimension.
A fact table can be thought of as having tuples, one per a recorded fact.
Fact table contains some measured or observed variable (s) and identifies it with pointers to dimension table.
A fact table contains the data and the dimensions identify each tuple in that data.
Cont…
Data Modeling for Data Warehouses
Two common multidimensional schemas are Star schema Snowflake schema
Star schema consists of a fact table with a single table for each dimension.
In a snowflake schema the dimensional tables from a star schema are organized into a hierarchy by normalizing them.
A fact constellation is a set of fact tables that share some dimension tables.
Cont… Data Modeling for Data Warehouses
Star schema
Cont…
Data Modeling for Data WarehousesPRODUCT
QUARTER
REGION
PROD. NO.
PROD. NAME.
PROD. DESCR.
PROD. STYLE
PROD. LINE
FACT TABLE
DIMENSION TABLE
DIMENSION TABLE
DIMENSION TABLEBUSINESS RESULTS
FISCAL QUATER
SALES REVENUE
PRODUCT
QTR
YEAR
BEG DATE
END DATE
REGION
SUBREGION
PROD. NO.
PROD. NAME.
PROD. DESCR.
PROD. STYLE
PROD. LINE
FACT TABLE
DIMENSION TABLE
DIMENSION TABLE
DIMENSION TABLES
BUSINESS RESULTS
FISCAL QUATER
SALES REVENUE
PRODUCT
QTR
YEAR
BEG DATE
END DATE
REGION
SUBREGION
PRODUCT
QUARTER
REGION
PROD. NAME
PROD. DESCR
PNAME
PLINE
PROD. LINE NO.
PROD. LINE NAME
BEG. DATA
END DATA
Cont..
Data Modeling for Data Warehouses
SNOWFLAKE SCHEMA
Join indexes relate the values of a dimension of a star schema to rows in the fact table.
Data warehouse storage can facilitate access to summary data .There are 2 approaches Smaller tables including summary data such as
quarterly sales or revenue by product line. Encoding of level into existing tables.
Cont…
Data Modeling for Data Warehouses
Acquisition of Data
Extracted from multiple, heterogeneous sources
Formatted for consistency within the warehouse
Cleaned to ensure validity Back flushing – process of returning cleaned data to
the source
Fitted into the data model of the warehouse Loaded into the warehouse
Data Storage Storing the data according to the data model of
the warehouse Creating and maintaining required data structures Creating and maintaining appropriate access paths Providing for time-variant data as new data are
added Supporting the updating of warehouse data Refreshing the data Purging the data
Design Considerations
Usage projections The fit of the data model Characteristics of available sources Design of the metadata component Modular component design Design for manageability and change Considerations of distributed and parallel
architecture
Preprogrammed Functions Roll-up
Data is summarized with increasing generalization
Drill-down Increasing levels of details are revealed Complement of roll-up
Pivot Cross tabulation or rotation is performed
Slice and Dice Performing projection operations
Preprogrammed Functions (contd…)
Sorting Data is sorted by ordinal value
Selection Data is available by value or range
Derived (Computed) Attributes Attributes are computed by operations on stored and
derived values
Other Functions
Efficient query processing Structured queries Ad hoc queries Data mining Materialized views Enhanced spreadsheet functionality
Data Warehouses vs. Views
Data warehouses exist as a persistent storage but views are being materialized on demand
Data warehouses are not usually relational but multidimensional. Views of a Relational Database are relational
Data warehouses can be indexed to optimize performance. Views cannot be indexed independent of the underlying databases
Data Warehouses vs. Views (contd…)
Data warehouses provide specific support of functionality but views cannot
Data warehouses provide large amounts of integrated and often temporal data, generally more than is contained in one database, whereas views are an extract of a database
Implementation Difficulties
Project Management Design Construction Implementation
Administration Quality control of data Managing a data warehouse
Open Issues
Data cleaning Indexing Partitioning Views Incorporation of domain and business rules
into warehouse creation and maintenance process making it more intelligent, relevant and self governing