136
Business Intelligence

Business Intelligence Session

  • Upload
    sahil

  • View
    38

  • Download
    4

Embed Size (px)

DESCRIPTION

What is Business Intelligence and its uses as a Business Manager

Citation preview

Page 1: Business Intelligence Session

Business Intelligence

Page 2: Business Intelligence Session

What is Business Intelligence (BI)

• Business Intelligence (BI) refers to skills, processes, technologies, applications and practices used to support decision making.

• Systems that provide directed background data and reporting tools to support and improve the decision-making process.

• A popularized, umbrella term used to describe a set of concepts and methods to improve business decision making by using fact-based support systems. The term is sometimes used interchangeably with briefing books and executive information systems.

• Business Intelligence is a broad category of applications and technologies for gathering, storing, analyzing, and providing access to data to help clients make better business decisions.

• A system that collects, integrates, analyses and presents business information to support better business decision making.

• Business Intelligence is an environment in which business users receive information that is reliable, secure, consistent, understandable, easily manipulated and timely...facilitating more informed decision making

Page 3: Business Intelligence Session

What is Business Intelligence (BI)

Improving organizations by providing business insights to all employees leading to better, faster, more relevant decisions

Page 4: Business Intelligence Session

What is Business Intelligence (BI)

Business Intelligence enables the business to make intelligent, fact-based decision

Aggregate Data

Database, Data Mart, Data Warehouse, ETL Tools,

Integration Tools

Present Data

EnrichData

Inform a Decision

Reporting Tools, Dashboards, Static

Reports, Mobile Reporting, OLAP Cubes

Add Context to Create Information, Descriptive Statistics, Benchmarks, Variance to Plan or LY

Decisions are Fact-based and Data-driven

Page 5: Business Intelligence Session

Content

The business determines the “what”, BI enables the “how”

Performance

Minimize report creation and collection times (near zero)

Usability

Delivery Method Push vs Pull

Medium Excel, PDF, Dashboard, Cube, Mobile Device

Enhance Digestion Fewer clicks

Tell a Story Trend, Context, Related Metrics, Multiple Views

CPU – Content, Performance, Usability

Page 6: Business Intelligence Session

Why is Business Intelligence So Important?

Time

With Business Intelligence, we can get data to you in a timely manner.

Making Business Decisions is a Balance

Data Opinion

(aka Best Professional Judgment)

In the absence of data, business decisions are often made by the HiPPO.

Page 7: Business Intelligence Session

Core Capabilities of BI

Page 8: Business Intelligence Session

Why do companies need BI?

Tactical / Strategic BI

What’s the best that can happen?

What will happen next?

What if these trends continue?

Why is this happening?

What actions are needed?

Where exactly is the problem?

How many, how often, where?

What happened?

Sophistication of Intelligence

Operational BI

Optimization

Predictive Modeling

Forecasting/extrapolation

Statistical analysis

Alerts

Query/drill down

Ad hoc reports

Standard reportsCo

mp

etit

ive

Ad

van

tag

e

Page 9: Business Intelligence Session

Benefits of Business Intelligence

• Improve Management Processes– planning, controlling, measuring and/or changing resulting in increased revenues and reduced costs

• Improve Operational Processes– fraud detection, order processing, purchasing.. resulting in increased revenues and reduced costs

• Predict the Future

Page 10: Business Intelligence Session

BI Golden Rules

• Data Quality & Accuracy

• Data Consistency

• Data Timeliness

“Get the right information to the right people at the right time”

Page 11: Business Intelligence Session

Gartner BI Maturity Model

Page 12: Business Intelligence Session

Business Intelligence

Reporting

Business Intelligence

Data Mining

Knowledge Management

Expert Systems

Page 13: Business Intelligence Session

Reporting ToolsSoftware that reads data

Processes data by:FilteringSortingGroupingSimple calculations

Produces formatted summaries of data

Page 14: Business Intelligence Session

OLAPOn-Line Analytical Processing

Consolidation

Drill-Down

Slicing and Dicing

Page 15: Business Intelligence Session

Drill Down

Page 16: Business Intelligence Session

Slice and Dice

Page 17: Business Intelligence Session

OLAP ExampleReport A

Report B

Page 18: Business Intelligence Session

Data WarehouseLarge Database

Subject-Oriented

Integrated

Time-Variant

Nonvolatile

User-Friendly Interface

Page 19: Business Intelligence Session

Data Warehouse System

Oper-ational DB

OtherDB

ExternalDB

DataWare-house

Reporting

Data Mining

KM

Expert

Page 20: Business Intelligence Session

Data MartsMini data warehouses

Hold subsets of data from the data warehouse

Data focuses on a specific aspect of the company

Page 21: Business Intelligence Session

Data Mining ToolsSoftware that searches through data

Uses complex statistical calculations

OutputsTrendsPatternsCorrelationsExceptions

Page 22: Business Intelligence Session

Knowledge Management• Process

• Creating Value from Intellectual Property

• Sharing Knowledge with Others

Page 23: Business Intelligence Session

Expert SystemsKnowledge captured as rules.

Page 24: Business Intelligence Session

BI ChallengesPoor data quality

Context

User resistance

Page 25: Business Intelligence Session

Database Management System (DMBS)

Page 26: Business Intelligence Session

Database Management System (DMBS)

DBMS contains information about a particular enterprise• Collection of interrelated data• Set of programs to access the data • An environment that is both convenient and efficient to use

Database Applications:• Banking: all transactions• Airlines: reservations, schedules• Universities: registration, grades• Sales: customers, products, purchases• Online retailers: order tracking, customized recommendations• Manufacturing: production, inventory, orders, supply chain• Human resources: employee records, salaries, tax deductions

Databases touch all aspects of our lives

Page 27: Business Intelligence Session

Purpose of Database Systems

In the early days, database applications were built directly on top of file systemsDrawbacks of using file systems to store data:

Data redundancy and inconsistencyMultiple file formats, duplication of information in different files

Difficulty in accessing data Need to write a new program to carry out each new task

Data isolation — multiple files and formatsIntegrity problems

Integrity constraints (e.g. account balance > 0) become “buried” in program code rather than being stated explicitlyHard to add new constraints or change existing ones

Page 28: Business Intelligence Session

Purpose of Database Systems (contd.)

Drawbacks of using file systems (cont.) Atomicity of updates

Failures may leave database in an inconsistent state with partial updates carried outExample: Transfer of funds from one account to another should either complete or not happen at all

Concurrent access by multiple usersConcurrent accessed needed for performanceUncontrolled concurrent accesses can lead to inconsistencies

Example: Two people reading a balance and updating it at the same time

Security problemsHard to provide user access to some, but not all, data

Database systems offer solutions to all the above problems

Page 29: Business Intelligence Session

View of DataView of DataAn architecture for a database system

View 1

Physicallevel

Logical level

View 2 View n…

View level

Page 30: Business Intelligence Session

Levels of Abstraction

Physical level: describes how a record (e.g. customer) is stored.Logical level: describes data stored in database, and the relationships among the data.

type customer = record name: string; street: string; city: integer;end;

View level: application programs hide details of data types. Views can also hide information (e.g. salary) for security purposes.

Page 31: Business Intelligence Session

Instances and SchemasSimilar to types and variables in programming languages

Schema – the logical structure of the database Example: The database consists of information about a set of customers and accounts and the relationship between them)Analogous to type information of a variable in a programPhysical schema: database design at the physical levelLogical schema: database design at the logical level

Instance – the actual content of the database at a particular point in time Analogous to the value of a variable

Physical Data Independence – the ability to modify the physical schema without changing the logical schema

Applications depend on the logical schemaIn general, the interfaces between the various levels and components should be well defined so that changes in some parts do not seriously influence others.

Page 32: Business Intelligence Session

Data IndependenceAbility to modify a schema definition in one level without affecting a schema definition in the other levels.The interfaces between the various levels and components should be well defined so that changes in some parts do not seriously influence others.Two levels of data independence

Physical data independenceLogical data independence

Page 33: Business Intelligence Session

Data ModelsA collection of tools for describing:

DataData relationshipsData semanticsData constraints

Object-based logical modelsEntity-relationship modelObject-oriented modelSemantic modelFunctional model

Record-based logical modelsRelational model (e.g., SQL/DS, DB2)Network modelHierarchical model (e.g., IMS)

Page 34: Business Intelligence Session

The Entity-Relationship Model

Models an enterprise as a collection of entities and relationships

Entity: a “thing” or “object” in the enterprise that is distinguishable from other objects

Described by a set of attributesRelationship: an association among several entities

Represented diagrammatically by an entity-relationship diagram:

Page 35: Business Intelligence Session

Relational Model

Example of tabular data in the relational model:

name ssn street city account-numberJohnson 192-83-7465 Alma Palo Alto A-101Smith 019-28-3746 North Rye A-215Johnson 192-83-7465 Alma Palo Alto A-201Jones 321-12-3123 Main Harrison A-217Smith 019-28-3746 North Rye A-201

account-number balanceA-101 500A-201 900A-215 700A-217 750

Page 36: Business Intelligence Session

Data Definition Language (DDL)Specification notation for defining the database schemaDDL compiler generates a set of tables stored in a data dictionaryData dictionary contains metadata (data about data)Data storage and definition language – special type of DDL in which the storage structure and access methods used by the database system are specified

Page 37: Business Intelligence Session

Data Manipulation Language (DML)

Language for accessing and manipulating the data organized by the appropriate data modelTwo classes of languages

Procedural – user specifies what data is required and how to get those dataNonprocedural – user specifies what data is required without specifying how to get those data

Page 38: Business Intelligence Session

Transaction ManagementA transaction is a collection of operations that performs a single logical function in a database application.Transaction-management component ensures that the database remains in a consistent (correct) state despite system failures (e.g. power failures and operating system crashes) and transaction failures.Concurrency-control manager controls the interaction among the concurrent transactions, to ensure the consistency of the database.

Page 39: Business Intelligence Session

Storage ManagementStorage manager is a program module that provides the interface between the low-level data stored in the database and the application programs and queries submitted to the system.The storage manager is responsible to the following tasks:

Interaction with the file manager Efficient storing, retrieving and updating of data

Issues:Storage accessFile organizationIndexing and hashing

Page 40: Business Intelligence Session

Database AdministratorCoordinates all the activities of the database system; the database administrator has a good understanding of the enterprise’s information resources and needs:

Database administrator’s duties include:Schema definitionStorage structure and access method definitionSchema and physical organization modificationGranting user authority to access the databaseSpecifying integrity constraintsActing as liaison with usersMonitoring performance and responding to changes in requirements

Page 41: Business Intelligence Session

Database Users

Users are differentiated by the way they expect to interact with the system.Application programmers: interact with system through DML calls.Specialized users: write specialized database applications that do not fit into the traditional data processing framework Sophisticated users: form requests in a database query language.Naive users: invoke one of the permanent application programs that have been written previously

Page 42: Business Intelligence Session

Overall System Structure

naïve users application sophisticated database (tellers, agents, etc) programmers users administrator

application interface

Application program

query database scheme

users

application program object code query evaluation

engine

Embedded DML precompiler

DML compiler

DDL interpreter

transaction manager

buffer manager

query processor

storage manager

database- management system

File manager

indices Statistical data

Data files Data dictionary

disk storage

Page 43: Business Intelligence Session

SQL (Structured Query Language)

SQL: widely used non-procedural languageExample: Find the name of the customer with customer-id 192-83-7465

select customer.customer_namefrom customerwhere customer.customer_id = ‘192-83-7465’

Example: Find the balances of all accounts held by the customer with customer-id 192-83-7465

select account.balancefrom depositor, accountwhere depositor.customer_id = ‘192-83-7465’ and

depositor.account_number = account.account_number

Application programs generally access databases through one ofLanguage extensions to allow embedded SQLApplication program interface (e.g., ODBC/JDBC) which allow SQL queries to be sent to a database

Page 44: Business Intelligence Session

Database Design

The process of designing the general structure of the database:

Logical Design – Deciding on the database schema. Database design requires that we find a “good” collection of relation schemas.

Business decision – What attributes should we record in the database?Computer Science decision – What relation schemas should we have and how should the attributes be distributed among the various relation schemas?

Physical Design – Deciding on the physical layout of the database

Page 45: Business Intelligence Session

Data Warehousing

Page 46: Business Intelligence Session

In the Beginning, life was simple…

Page 47: Business Intelligence Session

But…

Page 48: Business Intelligence Session

Our information needs…

Page 49: Business Intelligence Session

Kept growing. (The Spider web)

SOURCE: William H. Inmon

Page 50: Business Intelligence Session

Purpose

To explore and discuss the purpose and principles of

data warehousing.

Page 51: Business Intelligence Session

So What Is a Data Warehouse?

A data warehouse is the data repository of an enterprise. It is generally used for research and decision support.

By comparison: an OLTP (on-line transaction processor) or operational system is used to deal with the everyday running of one aspect of an enterprise.

OLTP systems are usually designed independently of each other and it is difficult for them to share information.

Page 52: Business Intelligence Session

Why Do We Need Data Warehouses?

Consolidation of information resourcesImproved query performanceSeparate research and decision support functions from the operational systemsFoundation for data mining, data visualization, advanced reporting and OLAP tools

Page 53: Business Intelligence Session

What Is a Data Warehouse Used for?

Knowledge discoveryMaking consolidated reportsFinding relationships and correlationsData miningExamples

Banks identifying credit risksInsurance companies searching for fraudMedical research

Page 54: Business Intelligence Session

How Do Data Warehouses Differ From Operational Systems?

Data warehouse Operational systemSubject oriented Transaction oriented

Large (hundreds of GB up to several TB)

Small (MB up to several GB)

Historic data Current data

De-normalized table structure (few tables, many columns per table)

Normalized table structure (many tables, few columns per table)

Batch updates Continuous updates

Usually very complex queries Simple to complex queries

Page 55: Business Intelligence Session

Design Differences

Star Schema

Data WarehouseOperational System

ER Diagram

Page 56: Business Intelligence Session

Supporting a Complete Solution

Operational System-Data Entry

Data Warehouse-Data Retrieval

Page 57: Business Intelligence Session

What Is a Data Warehouse?In many organizations, we want a central “store” of all of our entities, concepts, metadata, and historical information

For doing data validation, complex mining, analysis, prediction, …This is the data warehouse

To this point we’ve focused on scenarios where the data “lives” in the sources – here we may have a “master” version (and archival version) in a central database

For performance reasons, availability reasons, archival reasons, …

Page 58: Business Intelligence Session

Master Data ManagementOne of the “modern” uses of the data warehouse is not only to support analytics but to serve as a reference to all of the entities in the organization

A cleaned, validated repository of what we know… which can be linked to by data sources… which may help with data cleaning

… and which may be the basis of data governance (processes by which data is created and modified in a systematic way, e.g., to comply with gov’t regulations)

There is an emerging field called master data management out the process of creating these

Page 59: Business Intelligence Session

Data Warehouses, Data Marts, and Operational Data Stores

Data Warehouse – The queryable source of data in the enterprise. It is comprised of the union of all of its constituent data marts.Data Mart – A logical subset of the complete data warehouse. Often viewed as a restriction of the data warehouse to a single business process or to a group of related business processes targeted toward a particular business group.Operational Data Store (ODS) – A point of integration for operational systems that developed independent of each other. Since an ODS supports day to day operations, it needs to be continually updated.

Page 60: Business Intelligence Session

Building a Data Warehouse

AnalysisDesign

Import dataInstall front-end tools

Test and deploy

Data Warehouse Lifecycle

Page 61: Business Intelligence Session

Stage 1: Analysis

Identify:Target Questions

Data needsTimeliness of data

Granularity

Create an enterprise-level data dictionaryDimensional analysisIdentify facts and dimensions

Analysis– Design– Import data– Install front-end tools– Test and deploy

Page 62: Business Intelligence Session

Stage 2: Design

Star schemaData TransformationAggregates Pre-calculated ValuesHW/SW Architecture

– Analysis

Design– Import data– Install front-end tools– Test and deploy

Dimensional Modeling

Page 63: Business Intelligence Session

Dimensional Modeling

Fact Table – The primary table in a dimensional model that is meant to contain measurements of the business.Dimension Table – One of a set of companion tables to a fact table. Most dimension tables contain many textual attributes that are the basis for constraining and grouping within data warehouse queries.

Page 64: Business Intelligence Session

Stage 3: Import Data

Identify data sourcesExtract the needed data from existing systems to a data staging areaTransform and Clean the data

Resolve data type conflictsResolve naming and key conflictsRemove, correct, or flag bad dataConform Dimensions

Load the data into the warehouse

– Analysis– Design

Import data– Install front-end tools– Test and deploy

Page 65: Business Intelligence Session

Importing Data Into the Warehouse

OLTP 1

OLTP 2

OLTP 3

Data Staging Area DataWarehouse

Operational Systems(source systems)

Page 66: Business Intelligence Session

Stage 4: Install Front-end Tools

Reporting toolsData mining toolsGISEtc.

– Analysis– Design– Import data

Install front-end tools– Test and deploy

Page 67: Business Intelligence Session

Stage 5: Test and Deploy

Usability testsSoftware installationUser trainingPerformance tweaking based on usage

– Analysis– Design– Import data– Install front-end tools

Test and deploy

Page 68: Business Intelligence Session

On-Line Analytical Processing (OLAP)

Making Decision Support Possible

Page 69: Business Intelligence Session

69

Strengths of OLAP

• It is a powerful visualization paradigm

• It provides fast, interactive response times

• It is good for analyzing time series

• It can be useful to find some clusters and outliers

• Many vendors offer OLAP tools

Page 70: Business Intelligence Session

70

Month1 2 3 4 76 5

Pro

du

ct

Toothpaste

JuiceColaMilk

Cream

Soap

Regio

n

WS

N

Dimensions: Product, Region, TimeHierarchical summarization paths

Product Region TimeIndustry Country Year

Category Region Quarter

Product City Month Week

Office Day

Multi-dimensional Data

• “Hey…I sold $100M worth of goods”

Page 71: Business Intelligence Session

71

Data Cube Lattice• Cube lattice

• ABC AB AC BC A B C none

• Can materialize some groupbys, compute others on demand• Question: which groupbys to materialze?• Question: what indices to create• Question: how to organize data (chunks, etc)

Page 72: Business Intelligence Session

72

Visualizing Neighbors is simpler

1 2 3 4 5 6 7 8AprMayJunJulAugSepOctNovDecJanFebMar

Month Store SalesApr 1Apr 2Apr 3Apr 4Apr 5Apr 6Apr 7Apr 8May 1May 2May 3May 4May 5May 6May 7May 8Jun 1Jun 2

Page 73: Business Intelligence Session

73

A Visual Operation: Pivot (Rotate)

10

47

30

12

Juice

Cola

Milk

Cream

NYLA

SF

3/1 3/2 3/3 3/4

Date

Month

Reg

ion

Product

Page 74: Business Intelligence Session

74

“Slicing and Dicing”

Product

Sales Channel

Regio

ns

Retail Direct Special

Household

Telecomm

Video

Audio IndiaFar East

Europe

The Telecomm Slice

Page 75: Business Intelligence Session

75

Roll-up and Drill Down

• Sales Channel• Region• Country• State • Location Address• Sales Representative

Roll

Up

Higher Level ofAggregation

Low-levelDetails

Drill-D

ow

n

Page 76: Business Intelligence Session

76

Nature of OLAP Analysis• Aggregation -- (total sales,

percent-to-total)• Comparison -- Budget vs. Expenses• Ranking -- Top 10, quartile analysis• Access to detailed and aggregate

data• Complex criteria specification• Visualization

Page 77: Business Intelligence Session

77

Organizationally Structured Data

• Different Departments look at the same detailed data in different ways. Without the detailed, organizationally structured data as a foundation, there is no reconcilability of data

marketing

manufacturing

sales

finance

Page 78: Business Intelligence Session

78

Multidimensional Spreadsheets

• Analysts need spreadsheets that support

• pivot tables (cross-tabs)• drill-down and roll-up• slice and dice• sort• selections• derived attributes

• Popular in retail domain

Page 79: Business Intelligence Session

79

OLAP - Data Cube

• Idea: analysts need to group data in many different ways• eg. Sales(region, product, prodtype, prodstyle, date, saleamount)• saleamount is a measure attribute, rest are dimension attributes• groupby every subset of the other attributes

• materialize (precompute and store) groupbys to give online response• Also: hierarchies on attributes: date -> weekday,

date -> month -> quarter -> year

Page 80: Business Intelligence Session

80

Relational OLAP: 3 Tier DSS

Data Warehouse ROLAP Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer

Store atomic data in industry standard RDBMS.

Generate SQL execution plans in the ROLAP engine to obtain OLAP functionality.

Obtain multi-dimensional reports from the DSS Client.

Page 81: Business Intelligence Session

81

MD-OLAP: 2 Tier DSSMDDB Engine MDDB Engine Decision Support Client

Database Layer Application Logic Layer Presentation Layer

Store atomic data in a proprietary data structure (MDDB), pre-calculate as many outcomes as possible, obtain OLAP functionality via proprietary algorithms running against this data.

Obtain multi-dimensional reports from the DSS Client.

Page 82: Business Intelligence Session

OLAPNeed for More Intensive Decision Support4 Main Characteristics

Multidimensional data analysisAdvanced Database SupportEasy-to-use end-user interfacesSupport Client/Server architecture

Page 83: Business Intelligence Session

Multidimensional Data Analysis Techniques

Advanced Data Presentation Functions3-D graphics, Pivot Tables, Crosstabs, etc.Compatible with Spreadsheets & Statistical packagesAdvanced data aggregations, consolidation and classification across time dimensionsAdvanced computational functionsAdvanced data modeling functions

Page 84: Business Intelligence Session

Advanced Database Support

Advanced Data Access FeaturesAccess to many kinds of DBMS’s, flat files, and internal and external data sourcesAccess to aggregated data warehouse dataAdvanced data navigation (drill-downs and roll-ups)Ability to map end-user requests to the appropriate data sourceSupport for Very Large Databases

Page 85: Business Intelligence Session

Easy-to-Use End-User Interface

Graphical User InterfacesMuch more useful if access is kept simple

Page 86: Business Intelligence Session

Client/Server ArchitectureFramework for the new systems to be designed, developed and implementedDivide the OLAP system into several components that define its architecture

Same ComputerDistributed among several computer

Page 87: Business Intelligence Session

OLAP Architecture3 Main Modules

GUI (Graphical User Interface)Analytical Processing LogicData-processing Logic

Page 88: Business Intelligence Session

OLAP Client/Server Architecture

Page 89: Business Intelligence Session

Relational OLAPRelational Online Analytical Processing

OLAP functionality using relational database and familiar query tools to store and analyze multidimensional data

Multidimensional data schema supportData access language & query performance for multidimensional dataSupport for Very Large Databases

Page 90: Business Intelligence Session

Multidimensional Data Schema Support

Decision Support Data tends to beNonnormalizedDuplicatedPreaggregated

Star SchemaSpecial Design technique for multidimensional data representationsOptimize data query operations instead of data update operations

Page 91: Business Intelligence Session

Star SchemasData Modeling Technique to map multidimensional decision support data into a relational databaseCurrent Relational modeling techniques do not serve the needs of advanced data requirements

Page 92: Business Intelligence Session

Data Warehouse Architecture

At the top – a centralized database

Generally configured for queries and appends – not transactionsMany indices, materialized views, etc.

Data is loaded and periodically updated via Extract/Transform/Load (ETL) tools

Data Warehouse

ETL ETL ETL ETL

RDBMS1 RDBMS2

HTML1 XML1

ETL pipelineoutputs

ETL

Page 93: Business Intelligence Session

ETL ToolsETL tools are the equivalent of schema mappings in virtual integration, but are more powerfulArbitrary pieces of code to take data from a source, convert it into data for the warehouse:

import filters – read and convert from data sourcesdata transformations – join, aggregate, filter, convert datade-duplication – finds multiple records referring to the same entity, merges themprofiling – builds tables, histograms, etc. to summarize dataquality management – test against master values, known business rules, constraints, etc.

Page 94: Business Intelligence Session

Example ETL Tool Chain

• This is an example for e-commerce loading

• Note multiple stages of filtering (using selection or join-like operations), logging bad records, before we group and load

Invoice line items

SplitDate-time

Filterinvalid

JoinFilterinvalid

Invalid dates/ times

Invaliditems

Itemrecords

Filternon -

match

Invalidcustomers

Group by customer

Customerbalance

Customerrecords

Page 95: Business Intelligence Session

An Introduction to Text Mining

Page 96: Business Intelligence Session

Predictive analysis helps connect data to effective action by drawing reliable conclusions about current conditions and future events.

— Gareth Herschel, Research Director, Gartner Group

Predictive Analytics: Defined

Predictive analysis:•Leverages an organization’s business knowledge by applying sophisticated analytic techniques to enterprise data•It turns that data into insights that lead to the development of programs to increase revenues, reduce costs, improve processes, and prevent criminal or fraudulent activities•It encourage actions that demonstrably change how people behave as your customers, employees, patients, students, and citizensBottom line: it turns data into effective actions that positively impact your bottom line

Page 97: Business Intelligence Session

Unstructured Data Management

Text Mining is a subset of Unstructured Data Management.

UDM can be broken down into:Content and Document ManagementSearch and Retrieval XML database and toolsCategorization, Classification, and Visualization

Page 98: Business Intelligence Session

Data Warehouse vs. Document Warehouse

Data warehouseWho, what, when, where, how muchInternally focusedOperational information Rarely include external information

Document warehouseWhyMay not be internally focusedMay contain a range of informationOften integrate external information

Page 99: Business Intelligence Session

Document Warehouse Features

There is no single document structure or document typeDocuments are drawn from multiple sourcesEssential features of documents are automatically extracted and explicitly stored in the document warehouseDocument warehouses are designed to integrate semantically related documents

Page 100: Business Intelligence Session

100

Building the Document Warehouse

IdentifySources

RetrieveDocument

TextAnalysis

Pre-processDocument

CompileMetadata

Page 101: Business Intelligence Session

101

Predict, Impact, Deploy

Customer

Data

Attitudes

Actions

Attributes

Business User

Grow

Retain

Fraud

Outcomes

Attract

Data Collection

Text

Surveys

WebChannel

OperationalSystems

Text

Bu

sin

ess

UI

Expert UIExpert UI

Concepts

Concept Maps

Clustering

Categoriza-tion

Trending

Information Extraction

Prediction

NLP

Page 102: Business Intelligence Session

The Building Blocks of Language

MorphologySyntaxSemanticsPhonologyPragmatics

Page 103: Business Intelligence Session

MorphologyUnderstanding words

StemsAffixes

PrefixSuffix

Inflectional elements

Reducing complexity of

analysis

Reduces complexity of

representation

Supports text mining

Noun

PrefixNoun Stem

Suffix

- abledisputein -

Page 104: Business Intelligence Session

SyntaxThe Bank of Canada will curb inflation with higher interest rates

Prepositional phrase

Adjective

Sentence

Noun phrase Verb phrase

NounVerbAux

Noun phrase

NounAdjective

Noun

The Bank ofCanada

inflationcurbwill

Interest rateshigher

with

Page 105: Business Intelligence Session

SemanticsThe meaning of it allApproaches to meaning

Semantic networksDeductive logicRule-based systems

Useful for classification

Page 106: Business Intelligence Session

Problems with NLPLimitations of Natural Language Processing

Correctly identifying the role of noun phrasesRepresenting abstract conceptsClassifying synonymsRepresenting the number of concepts

Page 107: Business Intelligence Session

Problems with NLPLimitations of technology

Language specific designs are requiredClassification speedClassifying hybrid words and sentences

Page 108: Business Intelligence Session

Underlying Technology is Based on Linguistics

The Linguistic Approach:Does not treat a document as a bag of words

Removes ambiguity by extracting structured concepts

Concepts are the DNA of text.

Text is unstructured, ambiguous, and language dependent.

Page 109: Business Intelligence Session

From Text to Concepts

Morphology

Syntax

Semantics StatisticsLinguistic

Terminology

Extractor

ScalableAccurate

Customizable Discovery-Oriented

•Compound words•Proper nouns•Figures•Named entities•Domain specifics

•Speed•Multiple formats•Multiple languages

•SPSS dictionaries•User dictionaries•Extraction rules•Extraction patterns

•Known terms•Unknown terms•New terms

•Trends

Page 110: Business Intelligence Session

“The process of discovering meaningful new relationships, patterns and trends by sifting through data using pattern recognition technologies as well as statistical and mathematical techniques.”- The Gartner group.

Data Mining

Page 111: Business Intelligence Session

Why data mining?Data Mining software generally employs modeling algorithms

designed to handle non-linearities and unusual patterns in dataAs opposed to classical linear models (e.g., linear regression) that aren’t as

capable

A related issue is ‘noise’ in the data: where, for example, 2 seemingly similar sets of inputs yield a different output

Page 112: Business Intelligence Session

Use the cross industry standard process for data mining (CRISP-DM)

Based on real-world lessons:Focus on business issues

User-centric & interactiveFull processResults are used

A Data Mining Methodology

Page 113: Business Intelligence Session

Data Mining is not…Keep in mind that data mining is not…

“Blind” application of analysis/modeling algorithmsBrute-force crunching of bulk dataBlack box technologyMagic

Page 114: Business Intelligence Session

114

Back to the Process

Text Mining

Page 115: Business Intelligence Session

UnderstandingBusiness Understanding

Determine objectiveAssess situationDetermine data mining goalsProduce project plan

Data UnderstandingCollect initial dataDescribe dataExplore dataVerify data quality

Page 116: Business Intelligence Session

Data PreparationData

Data setData set descriptionSelect dataClean dataConstruct data set / Integrate dataFormat data

TextConcept extractionConcept combinationConcept assessment

Page 117: Business Intelligence Session

ModelingSelect modeling technique

Universe of techniquesAppropriate techniques

DataText

RequirementsConstraintsSelected tools

Generate test designRun model(s)Assess model(s)

Page 118: Business Intelligence Session

EvaluationResults = Models + Findings

Evaluate resultsReview processDetermine next steps

Page 119: Business Intelligence Session

DeploymentPlan deploymentPlan monitoring and maintenanceFinal reportProject review

Page 120: Business Intelligence Session

Unsupervised methods:Group patients by drugs and demographic information and try to find

unusual patients

Supervised methods:Attempt to predict amount due and find sets of cases where the amount

due is very different from the predicted amount

Data Mining Approaches

Page 121: Business Intelligence Session

Copyright 2003-4, SPSS Inc. 121

What Does Data Mining Do?• Data mining uses existing data to:

• Predict• Category membership• Numeric Value• Ie. Credit risk

• Group• Cluster (group) things together based

on their characteristics• Ie. Different types of TV viewers

• Associate• Find events that occur together, or in

a sequence• Ie. Beer and diapers

• Find outliers• Identify cases that don’t follow

expected behavior• Ie. Fraudulent behaviour

Page 122: Business Intelligence Session

Data Mining vs. Text MiningData Mining

• Process directly• Identify causal

relationship• Structured numeric

transaction data residing in rational data warehouse

• Text Mining• Linguistic processing or

natural language processing (NLP)

• Discover heretofore unknown information[2]

• Applications deal with much more diverse and eclectic collections of systems and formats[4]

Page 123: Business Intelligence Session

ConfusionIs text mining the same as information extraction? No!Information Extraction (IE)

Extract facts about pre-specified entities, events or relationships from unrestricted text sources.No novelty: only information is already present is extracted.

Page 124: Business Intelligence Session

Two Foundations

Information Retrieval (IR)

Artificial Intelligence (AI)

Page 125: Business Intelligence Session

Information RetrievalThe science of searching for

• Information in documents• Documents themselves• Metadata which describe documents• Text, sound, images or data, within database:

relational stand-alone database or hypertext networked databases such as the Internet or intranets.

Page 126: Business Intelligence Session

Information RetrievalGerard Salton

Functional overview of IR

Page 127: Business Intelligence Session

Artificial IntelligenceArtificial intelligence (AI) is a branch of computer science and engineering that deals with intelligent behavior, learning, and adaptation in machines.

Page 128: Business Intelligence Session

Natural Language ProcessingNatural language processing (NLP) is a subfield of artificial intelligence and linguistics. It studies the problems of automated generation and understanding of natural human languages. Statistical natural language processing uses stochastic, probabilistic and statistical methods to resolve some of the difficulties : e.g. text segmentation, word sense disambiguation

Page 129: Business Intelligence Session

Data Warehousing, OLAP and data mining: what and why?

Page 130: Business Intelligence Session

130

Data Warehouse Architecture

Data Warehouse Engine

Optimized Loader

ExtractionCleansing

AnalyzeQuery

Metadata Repository

RelationalDatabases

LegacyData

Purchased Data

ERPSystems

Page 131: Business Intelligence Session

Data Warehouse for Decision Support & OLAP

Putting Information technology to help the knowledge worker make faster and better decisions

Which of my customers are most likely to go to the competition?

What product promotions have the biggest impact on revenue?

How did the share price of software companies correlate with profits over last 10 years?

Page 132: Business Intelligence Session

Decision SupportUsed to manage and control business

Data is historical or point-in-time

Optimized for inquiry rather than update

Use of the system is loosely defined and can be ad-hoc

Used by managers and end-users to understand the business and make judgements

Page 133: Business Intelligence Session

Data Mining works with Warehouse Data

Data Warehousing provides the Enterprise with a memory

Data Mining provides the Enterprise with intelligence

Page 134: Business Intelligence Session

We want to know ...Given a database of 100,000 names, which persons are the least likely to default on their credit cards?

Which types of transactions are likely to be fraudulent given the demographics and transactional history of a particular customer?

If I raise the price of my product by Rs. 2, what is the effect on my ROI?

If I offer only 2,500 airline miles as an incentive to purchase rather than 5,000, how many lost responses will result?

If I emphasize ease-of-use of the product as opposed to its technical capabilities, what will be the net effect on my revenues?

Which of my customers are likely to be the most loyal?

Data Mining helps extract such information

Page 135: Business Intelligence Session

What makes data mining possible?

Advances in the following areas are making data mining deployable:• data warehousing • better and more data (i.e., operational, behavioral, and

demographic) • the emergence of easily deployed data mining tools and • the advent of new data mining techniques.

Page 136: Business Intelligence Session

Why Separate Data Warehouse?

Missing data: Decision support requires historical data, which op dbs do not typically maintain.

Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources.

Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be reconciled.