Upload
dylan-taylor
View
285
Download
8
Tags:
Embed Size (px)
Citation preview
CHAPTER 5CHAPTER 5BUSINESS INTELLIGENCE: DATA BUSINESS INTELLIGENCE: DATA WAREHOUSING, DATA ACQUISITION, DATA WAREHOUSING, DATA ACQUISITION, DATA MINING, BUSINESS ANALYTICS, AND MINING, BUSINESS ANALYTICS, AND VISUALIZATIONVISUALIZATION
5-1
Data, Information, Knowledge
5-2
Data Collection, Problems, and Quality
Data Collection: could be done manually or by instruments and sensors
Data collection methods are surveys (using questionnaires), observations (using video cameras), and collecting information from experts (e.g., using interviews). In addition, sensors and scanners are used for automatic data collection.
Suggest a reliable method of data collection to be used to identify a customer buying patterns.
5-3
Data Collection, Problems, and Quality (con.)
5-4
Data Problems The major DSS data problems are summarized in following table along with some possible solutions
Data Collection, Problems, and Quality (con.)
5-5
Data quality determines the usefulness of data as well as the quality of the decisions based on them.
Data quality problems are divided into following four categories and dimensions:
Contextual data quality Intrinsic data quality Accessibility data quality Representation data quality
Often neglected or casually handled Problems exposed when data is summarized
Data Integrity 5-6
Data integrity assures the accuracy and consistency of data
One of the major issues of DQ is data integrity
Data integrity issues Uniformity Version Completeness check Conformity check Genealogy or drill-down
Data Access and Integration5-7
Recognize what to access • Integrate disparate and
heterogeneous databases to develop enterprise-wide systems
XML becomes standard language for database integration and data, transfer
Database Management Systems
5-8
Software program for managing a database
Manages data (i.e. update , delete , insert, sort, manipulate and retrieve data)
Generates reports Better data security Combines with modeling language for
construction of DSS
Database Models
Relational Flat, two-dimensional tables with multiple
access queries It is simple for the user to learn & easily
expanded or altered Can be accessed in a number of formats not
anticipated at the time of the initial design and development of the database
It can support large amount of data Hierarchical
Top down, like a tree Fields have only one “parent”, each “parent”
can have multiple “children” quick & useful mainly in transaction processing
Network Relationships created through linked lists, using
pointers “Children” can have multiple “parents” It can save storage space through the sharing of
some items
5-9
Database Models (con.)5-10
Object oriented Data analyzed at conceptual level Inheritance, abstraction, encapsulation
Multimedia Based Multiple data formats like JPEG, GIF, bitmap, PNG, sound,
video, virtual reality Requires specific hardware for full feature availability
Document Based Document storage and management
Intelligent Intelligent agents and ANN
Inference engines
Data Warehouse
© 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang
5-11
is a comprehensive database that supports all decision analysis required by an organization by providing summarized and detailed information.
It has access to all information relevant to the organization, which may come from many different sources, both internal and external.
Data Warehouse (con.) Data extraction:
get data from sources Data cleaning:
detect errors in the data and rectify them when possible
Data transformation: convert data from host format to warehouse
format , check integrity Load:
sort, summarize, consolidate, compute views, and build indices and partitions
propagate the updates from the data sources to the warehouse
12
Data warehouse characteristicsSubject orientedData from both internal and external sources is presentedScrubbed so that data from heterogeneous sources are standardizedTime-variant Nonvolatile
Read onlyNot normalized; may be redundantMetadata included
5-13
Characteristics of Data Warehouses- Subject oriented Organized around major subjects, such as
product, sales.
Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing.
Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision process.
14
Characteristics of Data Warehouses- Integrated Constructed by integrating multiple,
heterogeneous data sources.
Data cleaning and data integration techniques are applied. Ensure consistency in naming conventions
(e.g.,LastName and FamilyName in DB1 and DB2 have the same signification)
encoding structures (e.g, Attribute User_Id is along int in DB1 and it is a string in DB2
attribute measures (e.g, cm vs inch) …
15
Characteristics of Data Warehouses- Time Variant Data warehouse data : provide
information from a historical perspective (e.g., past 5-10 years)
Every data in the data warehouse contains an element of time.
16
Characteristics of Data Warehouses- Non Volatile Operational update of data doesn’t occur
in the data warehouse environment.
Doesn't require transaction processing, recovery, and concurrency control mechanism.
Require only two operations in data accessing Initial loading of data and quering.
17
Characteristics of Data Warehouses- Metadata included Metadata refers to data about data The primary purpose of metadata should
be to provide context to the data; that is, enriching information leading to knowledge
Plays vital role in explaining how , why, and where data can be found, retrieved, stored and used efficiently in an information system
5-18
Data Warehouse vs. Heterogeneous DBMS Traditional heterogeneous DB
integration: Build wrappers/mediators on top of heterogeneous
databases Query driven approach
A query posed to a client site, will be transformed into queries appropriate for individual heterogeneous sites involved, and the results are integrated into a global answer set
Data warehouse: Update-driven
Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis
19
Data Warehouse vs. operational databases
DW Traditional DB
Large amount of data from multiple sources that may include different DB models or files acquired from independent systems and platforms.
It is a transactional (relational, object-oriented)
Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Optimizes for retrieval.
Focusing on daily operations or transaction processing Optimizes for routine transaction processing
Provide information from a historical perspective (e.g., past 5-10 years).
Current value data.
It is nonvolatile. In traditional DB ,transactions are the agent of change to the database.
Supports DSS, Data Mining and OLAP. Supports OLTP.
20
From tables to Data cubes A data warehouse is based on a multidimensional data
model which views data in the form of data cube. A data cube, such as sales, allows data to be modeled
and viewed in multiple dimensions: Dimension tables contains descriptions about the
subject of the business. such as item (item_name, brand, type) or time (day,
week, month, quarter, year Fact table contain a factual or quantitative data
Fact table also contains measures (such as dollars_sold) and keys to each of the related dimension tables.
21
From tables to Data cubes (cont.)
Relational representation of pivot table
22
From tables to Data cubes (cont.)
2-D view of sales cross-tabulation (pivot table)
23
From tables to Data cubes (cont.)
24
Conceptual Modeling of Data Warehouses Modeling data warehouses: dimensions &
measures Star schema: a fact table in the middle
connected to a set of dimension tables.
Snowflake schema: a refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension table, forming a shape similar to snowflake.
Fact constellations: multiple fact tables share dimension tables, viewed as a collection of stars.
25
Example of Star Schema26
Example of Snowflake Schema
27
Example of Fact Constellation
28
Multidimensional Data
Dimensions are : product, month, region
Measure is sales_amount
29
Data Marts5-30
It is a subset of data warehouse, typically consisting of single subject are
Dependent Created from warehouse Replicated
Functional subset of warehouse
Independent Scaled down, less expensive version of data
warehouse Designed for a department or SBU or department Organization may have multiple data marts
Difficult to integrate
OLAP
It refers to variety of activities usually performed by end users in online systems.
No agreement on what activities are considered OLAP. However, one OLAP tool includes such activities as: Requesting ad hoc report and graphs Conducting statistical analysis Modeling and visualization capabilities Building DSS
5-31
OLAP Tools Known as business intelligence, business analytics, decision support, data access,
database front ends OLAP vs. OLTP tools
Codd’s 12 rules of OLAP tool Multidimensional conceptual view Transparency Accessibility Consistent reporting performance Client-server architecture Generic dimensionality Dynamic sparse matrix handling Multi-user support Unrestricted cross-dimensional operations Intuitive data manipulation Flexible reporting Unlimited dimensions and aggregation levels
5-32
OLTP vs. OLAP
OLTP
(On Line Transaction Processing)
OLAP
(On Line Analytical Processing)
User Any one Decision-makers, analysts.
Function Day to day operations. Decision support.
DB Design Application-oriented (E-R based) Subject-oriented (Star, snowflake)
Data Current. Historical.
View Detailed. Summarized.
Access Read/write. Read Mostly.
# Records accessed
Tens. Millions.
#Users Thousands. Hundreds.
Db size 100 MB-GB. 100 GB-TB.
33
Typical OLAP operations
Roll up (drill-up): summarize data by climbing up hierarchy by dimension reduction
Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, introducing new dimensions
Slice and dice: project and select Slice
Performs a selection on one dimension of the given cube, resulting in a sub-cube.
Reduces the dimensionality of the cubes. Dice
Refers to range select condition on one dimension, or to select condition on more than one dimension.
Reduces the number of member values of one or more dimensions. Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes.
34
OLAP-Roll up (drill-up)35
Roll-Up
Food Line Outdoor Line CATEGORY_total
Canada 29,116.5 69,310 98,426.5
Mexico 12,743.5 24,284 37,027.5
United States 102,561.5 232,679 335,240.5
Food Line Outdoor Line CATEGORY_total North America 144,421.5 326,273 470,694.5
OLAP-Drill down (roll down)
36
Food Line Outdoor Line CATEGORY_total Asia 59,728 151,174 210,902
Food Line Outdoor Line CATEGORY_total
Malaysia 618 9,418 10,036
China 33,198.5 74,165 107,363.5
India 6,918 0 6,918
Japan 13,871.5 34,965 48,836.5
Singapore 5,122 32,626 37,748
Belgium 7797.5 21,125 28,922.5
Drill-Down
OLAP-Slice37
Slice
Food Line Outdoor Line CATEGORY_total North America 144,421.5 326,273 470,694.5
992,481690,751301,730REGION_total
470,694.5326,273144,421.5North America
310,884.5213,30497,580.5Europe
210,902151,17459,728Asia
CATEGORY_total
Outdoor Line
Food Line
992,481690,751301,730REGION_total
470,694.5326,273144,421.5North America
310,884.5213,30497,580.5Europe
210,902151,17459,728Asia
CATEGORY_total
Outdoor Line
Food Line
OLAP-Dice38
Food Line Outdoor Line
Mexico 12,743.5 24,284
United States 102,561.5 232,679
Dice
Food Line Outdoor Line CATEGORY_total
Canada 29,116.5 69,310 98,426.5
Mexico 12,743.5 24,284 37,027.5
United States 102,561.5 232,679 335,240.5
Data Mining5-39
Process that uses statistical, mathematical, artificial intelligence, and machine-learning techniques to extract and identify useful information and subsequent knowledge from large databases
Automatic and quick data analysis Data mining includes tasks/activities known as:
Knowledge extraction Data archaeology Data exploration Data pattern processing Data dredging Information harvesting
How Data Mining Works5-40
Three types of methods are used to identify patterns in data Simple models (SOL-based query, OLAP, human judgment) Intermediate models (regression, decision trees, clustering) Complex models (neural networks, other rule induction)
Data mining application classes Classification Clustering Association Sequencing Regression Forecasting Others
Hypothesis Vs. Discovery Driven Data Mining Hypothesis driven data mining begins
with a proposition by the user, who then seeks to validate the truthfulness of the proposition. For example, a marketing manager may begin with the proposition, "Are DVD players sales related to sales of television sets?"
Discovery- driven data mining finds patterns, associations, and relationships among the data. It can uncover facts that were previously unknown
5-41
Tools and Techniques5-42
Data mining tools and techniques Statistical methods (association , regression
and cluster ) Decision trees (classification , clustering ) Case based reasoning(pattern detection ) Neural computing (pattern detection ) Intelligent agents Genetic algorithms
Text Mining
It is the application of data mining to nonstructured or less structured text files
It helps the organization to: Find the "hidden" content of documents, including additional useful
relationships. Relate documents across previous unnoticed divisions; for
example, discover that customers in two different product divisions have the same characteristics.
Group documents by common themes; for example, all the customers of an insurance firm who have similar complaints and cancel their policies
5-43
Multidimensionality5-44
It is an efficient way to organize data in different ways for analysis and presentation.
Its major advantage is that the data will be organized according to managers need, not analysts
Three factors ate considered in multidimensionality: dimensions, measures, and time. Here are some examples: Dimensions: products, salespeople, market segments,
business units, geographic locations, distribution channels, countries, industries
Measures: money, sales volume, head count, inventory profit, actual vs. forecasted
Time: daily, weekly, monthly, quarterly, yearly.
Data Visualization5-45
Technologies supporting visualization and interpretation Digital imaging, GIS, GUI, tables,
multidimensions, graphs, VR, 3D, animation Identify relationships and trends
Data manipulation allows real time look at performance data
Multidimensionality
Multidimensionality has some limitations The multidimensional database can take up
significantly more computer storage Multidimensional products cost significantly
more Database loading consumes system
resources and time, depending on data volume and number of dimensions.
Interfaces and maintenance are more complex than in relational databases.
5-46
Geographic Information System (GIS)
5-47
Computerized system for managing and manipulating data with digitized maps Geographically oriented Geographic spreadsheet for models Software allows web access to maps Used for modeling and simulations
GIS (con.)5-48
References
" 4 Regression." Regression. N.p., n.d. Web. 13 Nov. 2014.
"5 Classification." Classification. N.p., n.d. Web. 13 Nov. 2014.
"7 Clustering." Clustering. N.p., n.d. Web. 13 Nov. 2014.
"8 Association." Association. N.p., n.d. Web. 13 Nov. 2014.
5-49