Upload
srividhya-lakshmanan
View
215
Download
0
Embed Size (px)
Citation preview
8/13/2019 IS ZC415-L1
1/41
BITSPilaniPilani Campus
BITS Pilani
presentationN.MEHALA
FACULTY,CS/IS GROUP
8/13/2019 IS ZC415-L1
2/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
DATA MINING
2/15/2014 2IS ZC415
IS ZC415Second Semester 2013-14
8/13/2019 IS ZC415-L1
3/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Text & Reference Books
Text Book Tan P. N. , Steinbach M & Kumar V. Introduction to
Data Mining Pearson Education, 2006.
Reference Books Han J & Kamber M, Data Mining: Concepts and
Techniques, Morgan Kaufmann Publishers,
Second Edition, 2006
Dunhum M.H. & Sridhar S. Data Mining-
Introductory and Advanced Topics, Pearson
Education, 2006.
2/15/2014 3IS ZC415
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
8/13/2019 IS ZC415-L1
4/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Agenda
Motivation: Why data mining?
What is data mining?
Data Mining: KDD Process? Data mining tasks
Major issues in data mining
Applications
2/15/2014 4IS ZC415
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
8/13/2019 IS ZC415-L1
5/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
[Every two days now we create as much information as we did
from the dawn of civilization up until 2003, Eric Schmidt ]
Data collection and data availability
Automated data collection tools, database systems, Web,computerized society
Major sources of abundant data
Business: Web, e-commerce, transactions, stocks,
Science: Remote sensing, bioinformatics, scientific
simulation, Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!
Necessity is the mother of invention - Plato :Data mining:
Automated analysis of massive data sets
2/15/2014
5IS ZC415
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
8/13/2019 IS ZC415-L1
6/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evolutionary Step Business Question EnablingTechnologies Product Providers Characteristics
Data Collection
(1960s)
"What was my total
revenue in the last
five years?"
Computers, tapes,
disks
IBM, CDC Retrospective,
static data delivery
Data Access
(1980s)
"What were unit
sales in New
England lastMarch?"
Relational
databases
(RDBMS),Structured Query
Language (SQL),
ODBC
Oracle, Sybase,
Informix, IBM,
Microsoft
Retrospective,
dynamic data
delivery at recordlevel
Data Warehousing
& Decision
Support
(1990s)
"What were unit
sales in New
England last
March? Drill down
to Boston."
On-line analytic
processing
(OLAP),
multidimensional
databases, data
warehouses
SPSS, Comshare,
Arbor, Cognos,
Microstrategy,NCR
Retrospective,
dynamic data
delivery at multiple
levels
Data Mining
(Emerging Today)
"Whats likely to
happen to Boston
unit sales next
month? Why?"
Advanced
algorithms,
multiprocessor
computers, massive
databases
SPSS/Clementine,
Lockheed, IBM,
SGI, SAS, NCR,
Oracle, numerous
startups
Prospective,
proactive
information
delivery
2/15/2014 6IS ZC415
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
The Evolution of Data Analysis
8/13/2019 IS ZC415-L1
7/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Why Mine Data? Commercial Viewpoint
Lots of data is being collected and warehoused
Web data, e-commerce
purchases at department/grocery stores
Bank/Credit Card transactions Computers have become cheaper and more powerful
Competitive Pressure is Strong
Provide better, customized services for an edge (e.g.
in Customer Relationship Management)
2/15/2014 7IS ZC415
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
8/13/2019 IS ZC415-L1
8/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Why Mine Data? Scientific Viewpoint
Data collected and stored at enormous speeds
(GB/hour)
remote sensors on a satellite
telescopes scanning the skies microarrays generating gene expression data
scientific simulations generating terabytes of data
Traditional techniques infeasible for raw data
Data mining may help scientists in classifying and segmenting data
2/15/2014 8IS ZC415
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
8/13/2019 IS ZC415-L1
9/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is NOT Data Mining?
Searching a phone number in a phone
book
Searching a keyword on Google
Generating histograms of salaries for
different age groups
Issuing SQL query to a database andreading the reply
2/15/2014 9IS ZC415
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
8/13/2019 IS ZC415-L1
10/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining is NOT
Data Warehousing
(Deductive) query processing
SQL/ Reporting
Software Agents
Expert Systems
Online Analytical Processing (OLAP) Statistical Analysis Tool
Data visualization
2/15/2014 10IS ZC415
8/13/2019 IS ZC415-L1
11/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is Data Mining?
Discovery of useful summaries of data - Ullman Extracting or Mining knowledge from large
amounts of data
The efficient discovery of previously unknownpatterns in large databases
Technology which predict future trends based on
historical data
It helps businesses make proactive and
knowledge-driven decisions
2/15/2014 11IS ZC415
8/13/2019 IS ZC415-L1
12/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Extraction of interesting (non-trivial, implicit,previously unknown and potentially useful)information or patterns from data in large
databases
What is Data Mining?
2/15/2014 12IS ZC415
8/13/2019 IS ZC415-L1
13/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is Data Mining?
Data mining is an integral part of
knowledge discovery in databases
(KDD), which is the overall process of
converting raw data into usefulinformation. This process consists of
series of transformation steps from
preprocessing to postprocessing of datamining results
2/15/2014 13IS ZC415
8/13/2019 IS ZC415-L1
14/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Process of Knowledge Discovery in Database(KDD)
Data
PreprocessingData Mining PostProcessing
Normalization.
Data subsetting
Filtering
Patterns,Visualizatio
n,Pattern
Interpretation
Inputdata
Input
DataInformation
2/15/2014 14IS ZC415
8/13/2019 IS ZC415-L1
15/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining:A KDD Process
Data mining: the core of
knowledge discovery
process.
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
2/15/2014 15IS ZC415
8/13/2019 IS ZC415-L1
16/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Steps of a KDD Process
Learning the application domain:
relevant prior knowledge and goals of
application
Creating a target data set: data selection
Data cleaning and preprocessing: (may
take 60% of effort!)
Data reduction and transformation:
Find useful features, dimensionality/variable
reduction, invariant representation.
2/15/2014 16IS ZC415
8/13/2019 IS ZC415-L1
17/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Steps of a KDD Process
Choosing functions of data mining
summarization, classification, regression,
association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
visualization, transformation, removingredundant patterns, etc.
Use of discovered knowledge
2/15/2014 17IS ZC415
8/13/2019 IS ZC415-L1
18/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining and Business Intelligence
Increasing potentialto support
business decisions End User
Business
Analyst
Data
Analyst
DBA
Making
Decisions
Data PresentationVisual ization Techniques
Data Mining
I nformation Discovery
Data Exploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data SourcesPaper, Fi les, Information Providers, Database Systems, OLTP
2/15/2014 18IS ZC415
8/13/2019 IS ZC415-L1
19/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining: Confluence of Multiple Disciplines
Data Mining
DatabaseTechnology
Statistics
OtherDisciplines
InformationScience
MachineLearning
Visualization
2/15/2014 19IS ZC415
8/13/2019 IS ZC415-L1
20/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining vs. Statistical Analysis
Statistical Analysis: Ill-suited for Nominal and Structured Data Types
Completely data driven - incorporation of domain knowledge not possible
Interpretation of results is difficult and daunting
Requires expert user guidance
Data Mining: Large Data sets
Efficiency of Algorithms is important
Scalability of Algorithms is important
Real World Data
Lots of Missing Values
Pre-existing data - not user generated
Data not static - prone to updates
Efficient methods for data retrieval available for use
2/15/2014 20IS ZC415
8/13/2019 IS ZC415-L1
21/41
8/13/2019 IS ZC415-L1
22/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining and Data Warehousing
Data Warehouse: a centralized data repository which can be queried forbusiness benefit.
Data Warehousing makes it possible to
extract archived operational data
overcome inconsistencies between different legacy data formats
integrate data throughout an enterprise, regardless of location, format,or communication requirements
incorporate additional or expert information
OLAP: On-line Analytical Processing
Multi-Dimensional Data Model (Data Cube)
Operations: Roll-up
Drill-down
Slice and dice
Rotate
2/15/2014 22IS ZC415
8/13/2019 IS ZC415-L1
23/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
DBMS, OLAP, and Data Mining
DBMS OLAP Data Mining
TaskExtraction of detailed
and summary data
Summaries, trends and
forecasts
Knowledge discovery
of hidden patterns
and insights
Type of result Information Analysis Insight and Prediction
Method
Deduction (Ask the
question, verify
with data)
Multidimensional data
modeling,
Aggregation,
Statistics
Induction (Build the
model, apply it to
new data, get the
result)
Example question
Who purchased
mutual funds in
the last 3 years?
What is the average
income of mutual
fund buyers by
region by year?
Who will buy a
mutual fund in the
next 6 months and
why?
2/15/2014 23IS ZC415
E l f DBMS OLAP d
8/13/2019 IS ZC415-L1
24/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of DBMS, OLAP and
Data Mining: Weather Data
Day outlook temperature humidity windy play
1 sunny 85 85 false no
2 sunny 80 90 true no
3 overcast 83 86 false yes
4 rainy 70 96 false yes
5 rainy 68 80 false yes
6 rainy 65 70 true no
7 overcast 64 65 true yes
8 sunny 72 95 false no
9 sunny 69 70 false yes
10 rainy 75 80 false yes
11 sunny 75 70 true yes
12 overcast 72 90 true yes
13 overcast 81 75 false yes
14 rainy 71 91 true no
DBMS:
2/15/2014 24IS ZC415
8/13/2019 IS ZC415-L1
25/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of DBMS, OLAP and
Data Mining: Weather Data
By querying a DBMS containing the above
table we may answer questions like: What was the temperature in the sunny days? {85, 80, 72, 69,
75} Which days the humidity was less than 75? {6, 7, 9, 11}
Which days the temperature was greater than 70? {1, 2, 3, 8, 10,
11, 12, 13, 14}
Which days the temperature was greater than 70 and the
humidity was less than 75? The intersection of the above two:{11}
2/15/2014 25IS ZC415
8/13/2019 IS ZC415-L1
26/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of DBMS, OLAP and
Data Mining: Weather Data
OLAP:
Using OLAP we can
create a
MultidimensionalModel of our data
(Data Cube).
For example using
the dimensions: time,
outlook and play we
can create the
following model.
9 / 5 sunny rainy overcast
Week 1 0 / 2 2 / 1 2 / 0
Week 2 2 / 1 1 / 1 2 / 0
2/15/2014 26IS ZC415
E l f DBMS OLAP d
8/13/2019 IS ZC415-L1
27/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of DBMS, OLAP and
Data Mining: Weather Data
Data Mining:
Using the ID3 algorithm we can produce
the following decision tree: outlook = sunny
humidity = high: no
humidity = normal: yes
outlook = overcast: yes
outlook = rainy
windy = true: no
windy = false: yes
2/15/2014 27IS ZC415
8/13/2019 IS ZC415-L1
28/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Major Issues in Data Mining
Mining methodology and user interaction
Mining different kinds of knowledge in databases
Interactive mining of knowledge at multiple levels of abstraction
Incorporation of background knowledge
Data mining query languages and ad-hoc data mining Expression and visualization of data mining results
Handling noise and incomplete data
Pattern evaluation: the interestingness problem
Performance and scalability Efficiency and scalability of data mining algorithms
Parallel, distributed and incremental mining methods
2/15/2014 28IS ZC415
8/13/2019 IS ZC415-L1
29/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Major Issues in Data Mining
Issues relating to the diversity of data types
Handling relational and complex types of data
Mining information from heterogeneous databases and global
information systems (WWW)
Issues related to applications and social impacts Application of discovered knowledge
Domain-specific data mining tools
Intelligent query answering
Process control and decision making
Integration of the discovered knowledge with existing knowledge:
A knowledge fusion problem
Protection of data security, integrity, and privacy
2/15/2014 29IS ZC415
8/13/2019 IS ZC415-L1
30/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining: Classification Schemes
Decisions in data mining
Kinds of databases to be mined
Kinds of knowledge to be discovered
Kinds of techniques utilized
Kinds of applications adapted
2/15/2014 30IS ZC415
8/13/2019 IS ZC415-L1
31/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decisions in Data Mining
Databases to be mined Relational, transactional, object-oriented, object-relational,
active, spatial, time-series, text, multi-media, heterogeneous,
legacy, WWW, etc.
Knowledge to be mined
Characterization, discrimination, association, classification,
clustering, trend, deviation and outlier analysis, etc.
Multiple/integrated functions and mining at multiple levels
Techniques utilized
Database-oriented, data warehouse (OLAP), machine learning,statistics, visualization, neural network, etc.
Applications adapted
Retail, telecommunication, banking, fraud analysis, DNA mining,
stock market analysis, Web mining, Weblog analysis, etc.
2/15/2014 31IS ZC415
8/13/2019 IS ZC415-L1
32/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining Tasks
Data Mining is generally divided into two
tasks.
1. Predictive tasks
2. Descriptive tasks
2/15/2014 32IS ZC415
8/13/2019 IS ZC415-L1
33/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Predictive Tasks
Objective: Predict the value of a specific
attribute (target/dependent
variable)based on the value of other
attributes (explanatory).
Example1: Judge if a patient has specific
disease based on his/her medical tests
results.
2/15/2014 33IS ZC415
8/13/2019 IS ZC415-L1
34/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Predictive Model
Example2: Credit Card Company Every purchase is placed in 1 of 4 classes
Authorize
Ask for further identification before authorizing
Do not authorize
Do not authorize but contact police
Two functions of Data Mining
Examine historical data to determine how the data
fit into 4 classes
Apply the model to each new purchase
2/15/2014 34IS ZC415
8/13/2019 IS ZC415-L1
35/41
D t Mi i t k
8/13/2019 IS ZC415-L1
36/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining tasks
Figure taken from M H Dunham book on Data Mining
DATAMINING
Predictive
Classification
Regression
Time Series
Analysis
Prediction
Descriptive
Clustering
Summarization
Association
Rules
Sequence
Discovery
2/15/2014 36IS ZC415
8/13/2019 IS ZC415-L1
37/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Summary
2/15/2014 IS ZC415 37
8/13/2019 IS ZC415-L1
38/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Conferences and Journals
on Data Mining
KDD Conferences
ACM SIGKDD Int. Conf. on
Knowledge Discovery in
Databases and Data Mining
(KDD)
SIAM Data Mining Conf. (SDM) (IEEE) Int. Conf. on Data Mining
(ICDM)
Conf. on Principles and practices
of Knowledge Discovery and
Data Mining (PKDD)
Pacific-Asia Conf. on Knowledge
Discovery and Data Mining
(PAKDD)
Other related conferences ACM SIGMOD
VLDB
(IEEE) ICDE
WWW, SIGIR
ICML, CVPR, NIPS
Journals
Data Mining and Knowledge
Discovery (DAMI or DMKD)
IEEE Trans. On Knowledge and
Data Eng. (TKDE)
KDD Explorations
ACM Trans. on KDD2/15/2014 38IS ZC415
Where to Find References?
8/13/2019 IS ZC415-L1
39/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Where to Find References?
DBLP, CiteSeer, Google
Data mining and KDD Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
Database systems
Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
AI & Machine Learning
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS,
etc.
Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems,
IEEE-PAMI, etc.
Web and IR
Conferences: SIGIR, WWW, CIKM, etc.
Journals: WWW: Internet and Web Information Systems, Statistics
Conferences: Joint Stat. Meeting, etc.
Journals: Annals of statistics, etc.
Visualization
Conference proceedings: CHI, ACM-SIGGraph, etc.
Journals: IEEE Trans. visualization and computer graphics, etc.2/15/2014 39IS ZC415
8/13/2019 IS ZC415-L1
40/41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Recommended Reference Books
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data.
Morgan Kaufmann, 2002
R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge
Discovery and Data Mining. AAAI/MIT Press, 1996
U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and KnowledgeDiscovery, Morgan Kaufmann, 2001
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Springer-Verlag, 2001
B. Liu, Web Data Mining, Springer 2006.
T. M. Mitchell, Machine Learning, McGraw Hill, 1997 G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press,
1991
S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with
Java Implementations, Morgan Kaufmann, 2nd ed. 2005
2/15/2014 40IS ZC415
8/13/2019 IS ZC415-L1
41/41
Web Resources
Text book's website at:
http://wwwusers.cs.umn.edu/~kumar/dmbook/.
PowerPoints for the text book are available viaanonymous ftp at:
ftp://ftp.aw.com/cseng/authors/tan
Other Resources:
http://web.ccsu.edu/datamining/resources.html
2/15/2014 41IS ZC415
http://www-users.cs.umn.edu/~kumar/dmbook/ftp://ftp.aw.com/cseng/authors/tanhttp://web.ccsu.edu/datamining/resources.htmlhttp://web.ccsu.edu/datamining/resources.htmlftp://ftp.aw.com/cseng/authors/tanhttp://www-users.cs.umn.edu/~kumar/dmbook/