Upload
tommy96
View
216
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Citation preview
Database and data warehousing◦ Chapter 4 pages 167-177
Data mining and OLAP◦ Chapter 8 pages 321-325
In managing information, physical deals with the structure of information as it resides on various storage media.
Logical deals with how knowledge workers view their information needs, and includes such terms as:◦CHARACTER - our smallest unit of
information.◦FIELD - group of related characters.◦RECORD - group of related fields.◦FILE - group of related records.◦DATABASE - group of logically
associated files.◦DATA WAREHOUSE - information from
many databases.
DATA DICTIONARY - contains the logical structure of information in a database.◦Definitions of all fields, records, and tables◦Relationships between tables◦Who is responsible for maintaining data in
the database◦Descriptions of who is authorized to access
different parts of the database Data dictionary contains meta data (data
about the data)
Sample Data Dictionary Report
Definition- a database that stores current and historical data designed to support business analysis activities and decision-making tasks of managers; typically a relational database model is used. The data warehouse uses special software (tools) to assist managers extract information.
Benefits improved access improved information isolation from operational systems tools permit advanced data analysis
Users and data marts
Extraction phase – create files on the computer that will store the data warehouse and move transaction data to this machine; data may come from many sources or parts of the organization
Transformation phase – cleanse and standardize the data. Why is this necessary?
Load phase – transfer the data from the transformation phase into the data warehouse
The ETL process becomes automated to make regular transfers of transaction data into the data warehouse
Operational Data Data is on many
systems Current operational
data Inconsistent data
definitions Functionally
organized data Data are constantly
changing Support OLTP
Warehouse Data Integrated in one
enterprise-wide system
Recent and historical data
Consistent data definitions
Data are organized around business entities
Data are stabilized Support OLAP
• Data mining (knowledge discovery in databases): Extraction of interesting (non-trivial, previously
unknown and potentially useful) information or patterns from data in large databases
• Similar terms Exploratory data analysis Data driven discovery Deductive learning Knowledge extraction
A computer-based data analysis process
• Utilizes historical organizational data, typically in a data warehouse
• Uses a variety of data analysis, modeling, and visualization techniques
• One use is to discover previously unknown patterns or potential relationships in the data: undirected
• Also used to make predictions, verify assumptions, or otherwise provide useful information: directed
• Allows businesses to make proactive, knowledge-driven decisions
• Prediction Use some variables to predict unknown or
future values of other variables• Description
Find human-interpretable patterns that describe the data
• Given a collection of data Each record contains a set of attributes, one of the
attributes is the class variable Usually, the given data set is divided into training and
test sets, with training set used to build the model and test set used to validate it
• Find a model for the class attribute as a function of the values of other attributes based on the training set Previously unseen data (test set) are used to
determine the accuracy of the model
Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product
Approach:• Use the data for a similar product introduced before• We know which customers decided to buy and which decided
otherwise; this {buy, don’t buy} decision is the class variable• Collect various demographic, lifestyle, and company-
interaction related information about all such customers• Use this information as input attributes to create a classifier
model
Goal: Predict fraudulent cases in credit card transactions
Approach:• Use past credit card transactions and the information on its
account-holder as attributes When does a customer buy, what does he buy,
how often he pays on time, etc• Determine whether past transactions were fraud or fair
transactions; this is the class variable• Create a model for the class of the transactions
Goal: To predict whether a customer is likely to be lost to a competitor
Approach:• Use detailed record of transactions with each of the past
and present customers, to find attributes How often the customer calls, where he calls,
what time-of-the day he calls most, his financial status, marital status, etc
• Label the customers as still with the company or left the company (churned)
• Create a model for churn
Clustering concerns segmenting a diverse population into several homogeneous subgroups or clusters. Clustering differs from classification in that there are no predefined classifications.
• Data points in one cluster are more similar to one another
• Data points in separate clusters are less similar to one another
Similarity Measures:• Euclidean distance if attributes are continuous• Other problem-specific measures
Goal: Subdivide a market into distinct subsets of customers where any subset may conceivably be selected as a market target to be reached with a distinct marketing mixApproach:
• Collect different attributes of customers based on their geographical and lifestyle related information
• Find clusters of similar customers• Measure the clustering quality by observing buying
patterns of customers in same cluster vs. those from different clusters
Goal: To find groups of documents that are similar to each other based on the important terms appearing in them
Approach: To identify frequently occurring terms in each document; form a similarity measure based on the frequencies of different terms; use it to cluster
Gain: Retrieval can utilize the clusters to relate a new document or search term to clustered documents
Given a set of records, each of which contains some number of items from a specific collection, produce dependency rules which will predict occurrence of an item based on occurrences of other items:
TID Items
1 Bread, Coke, Milk
2 Beer, Bread
3 Beer, Coke, Diaper, Milk
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
Goal: To identify items that are bought together by sufficiently many customers
Approach: Process the point-of-sale data collected with barcode scanners to find dependencies among items
A classic case: If a customer buys diapers on Friday evening, then he is very likely to buy beer
So, don’t be surprised if you find six-packs stacked next to diapers!
• Privacy • • Profiling
• Unauthorized Use
• Big Brother
Customer loyalty cards have multiple uses, but one use is to collect data for the data warehouse
Examples◦Grocery stores◦Web sites◦Harrah’s◦Store related credit cards
Assurance of a steady flow of data
Multidimensional data analysis (or OLAP) enables users to view data using various dimensions, measures and time frames (i. e., OLAP)◦dimensions: products, business units,
country, industry (e.g., categories)◦measures: money, unit sales, head
count, variances◦ time: daily, weekly, monthly, quarterly,
yearly) This type of analysis also provides the
ability to view data in different ways (tables, charts, 3-D, geographically)
OLAP tools provide for this Pivot tables in Excel or Access
Pro
pert
y
City
Time
Three dimensional revenue model
14670
15056
15888
14555
16004
14578
15500
15890Fla
tH
ouse
Glasgow
London
Aberdeen
Q1 Q2 Q3 Q4
Property Type City Time
Total Revenue
Flat Glasgow Q1 15056
House Glasgow Q1 14670
Flat Glasgow Q2 14555
House Glasgow Q2 15888
Flat Glasgow Q3 14578
House Glasgow Q3 16004
Flat Glasgow Q4 15890
House Glasgow Q4 15500
Flat London Q1 19678
House London Q1 23877
Flat London Q2 19567
House London Q2 28677
• A common operation is to aggregate a measure over one or more dimensionsFind the total revenueFind the total revenue for each cityFind the top property-type for the 3rd
quarter based on total revenue across all cities
• Roll-up: Aggregating data across different dimension levelsExample: given revenue by city, we can roll-up to get
total revenue by state
• Drill-down: The inverse of roll-up: disaggregating dataExample: given total revenue by state, we can drill-
down to get revenue by city
Glasgow London Aberdeen
Q1 29726 43555 53210
Q2 30443 48244 34567
Q3 30582 56222 45677
Q4 31390 45632 50056Glasgow
London Aberdeen
Jan 9035 21005 5216
Feb 10788 14799 14944
Mar 9903 7751 33050
Apr 11273 10573 21884
May
9005 16896 8573
Jun 10265 20775 4110
Rollup
Drill Down
• Slicing & Dicing: Selecting data within dimension categoriesExample: given revenue by city for the entire year, we
can extract the revenue for a given city for a given quarter
• Rotating: Reorienting the presentation of the data cubeExample: given a cube that presents revenue by city &
property type for each quarter, we can change the presentation to present the revenue by property type & quarter for each city
Example: OLAP Usage at an Automobile Dealership
The StoryAn automobile dealership manager wants to improve business activity. Therefore she wants to view sales figures from different perspectives.
A QuestionWhat is the sales volume for a specific model and colors, for a specific salesperson ?
The Data NeedsSales by model
Sales by salesperson
Sales by color
Example: The Multi-dimensional Data Model Used
Sales Volume
Blue Red White
Van
Coupe
Sedan
Miller
Clyde
Smith
COLOR
SALESPERSON
MOD E L
Example: OLAP “Slicing & Dicing“ – Selecting Categories
Sales Volumes
Blue
Red White
Van
Coupe
Sedan
Miller
Clyde
Smith
COLOR
SALESPERSON
MOD E L
• Color: Blue and White
Clyde
Blue
White
Coupe
“Sliced & Diced“ Data
Choose a range out of each dimension: • Model: Coupe only
• Salesperson: Clyde only
Example: OLAP “Rotation“ – Changing the Presentation of the Cube
Sales Volume
Blue Red White
Van
Coupe
Sedan
COLOR
MOD E L
View of the Account Manager
Rotate the data cube by 90°
SALESPERSON
Sales Volume
Miller
Smith Clyde
Van
Coupe
Sedan
MOD E L
View of the Product Manager
Different Users will require different views of the multidimensional cube
Example: OLAP Drill-Down and Roll-UpData can be disaggregated and aggregated along a dimension according to the natural hierarchy
Drill-Down
Roll-Up
State
Region
Salesperson Miller Smith
Clyde Lucas Gleason
Atlanta Athens
Georgia
Sales Volume by Organization Dimension- three level hierarchy -
Primarily used to exploit data warehouses Provides extremely fast response View combinations of two dimensions Enable drilling down (start with broad info
and get more specific) Produces results as counts or percentages Conversion of tables to charts/graphs Usually requires a tailored-made relational
database OLAP applications are widely used by mid-
level and upper level managers A form of business intelligence software
Go to www.fedscope.opm.gov ◦Under data cubes on entry page click on
employment◦Demonstrate drill down and adding charts◦Data for this example comes from the
Central Personnel Data File (CPDF) of the federal government
◦The OLAP tool used to build this site is from a company named Cognos (PowerPlay)
OLAP tools based on Excel◦ http://www.cubularity.com
Linkage between elements◦ spreadsheet - between cells in same table◦ DBMS - between elements in different tables
Orientation◦ spreadsheet is toward calculations◦ DBMS is tilted toward organization and
linkage of data elements in different tables Capabilities
◦ DBMS has extensive querying and reporting power
◦ spreadsheet is limited Memory requirements
◦ entire spreadsheet table must be in memory◦ not true for the database table