IS ZC415-L1

Embed Size (px)

Citation preview

  • 8/13/2019 IS ZC415-L1

    1/41

    BITSPilaniPilani Campus

    BITS Pilani

    presentationN.MEHALA

    FACULTY,CS/IS GROUP

  • 8/13/2019 IS ZC415-L1

    2/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    DATA MINING

    2/15/2014 2IS ZC415

    IS ZC415Second Semester 2013-14

  • 8/13/2019 IS ZC415-L1

    3/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Text & Reference Books

    Text Book Tan P. N. , Steinbach M & Kumar V. Introduction to

    Data Mining Pearson Education, 2006.

    Reference Books Han J & Kamber M, Data Mining: Concepts and

    Techniques, Morgan Kaufmann Publishers,

    Second Edition, 2006

    Dunhum M.H. & Sridhar S. Data Mining-

    Introductory and Advanced Topics, Pearson

    Education, 2006.

    2/15/2014 3IS ZC415

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

  • 8/13/2019 IS ZC415-L1

    4/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Agenda

    Motivation: Why data mining?

    What is data mining?

    Data Mining: KDD Process? Data mining tasks

    Major issues in data mining

    Applications

    2/15/2014 4IS ZC415

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

  • 8/13/2019 IS ZC415-L1

    5/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Why Data Mining?

    The Explosive Growth of Data: from terabytes to petabytes

    [Every two days now we create as much information as we did

    from the dawn of civilization up until 2003, Eric Schmidt ]

    Data collection and data availability

    Automated data collection tools, database systems, Web,computerized society

    Major sources of abundant data

    Business: Web, e-commerce, transactions, stocks,

    Science: Remote sensing, bioinformatics, scientific

    simulation, Society and everyone: news, digital cameras, YouTube

    We are drowning in data, but starving for knowledge!

    Necessity is the mother of invention - Plato :Data mining:

    Automated analysis of massive data sets

    2/15/2014

    5IS ZC415

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

  • 8/13/2019 IS ZC415-L1

    6/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Evolutionary Step Business Question EnablingTechnologies Product Providers Characteristics

    Data Collection

    (1960s)

    "What was my total

    revenue in the last

    five years?"

    Computers, tapes,

    disks

    IBM, CDC Retrospective,

    static data delivery

    Data Access

    (1980s)

    "What were unit

    sales in New

    England lastMarch?"

    Relational

    databases

    (RDBMS),Structured Query

    Language (SQL),

    ODBC

    Oracle, Sybase,

    Informix, IBM,

    Microsoft

    Retrospective,

    dynamic data

    delivery at recordlevel

    Data Warehousing

    & Decision

    Support

    (1990s)

    "What were unit

    sales in New

    England last

    March? Drill down

    to Boston."

    On-line analytic

    processing

    (OLAP),

    multidimensional

    databases, data

    warehouses

    SPSS, Comshare,

    Arbor, Cognos,

    Microstrategy,NCR

    Retrospective,

    dynamic data

    delivery at multiple

    levels

    Data Mining

    (Emerging Today)

    "Whats likely to

    happen to Boston

    unit sales next

    month? Why?"

    Advanced

    algorithms,

    multiprocessor

    computers, massive

    databases

    SPSS/Clementine,

    Lockheed, IBM,

    SGI, SAS, NCR,

    Oracle, numerous

    startups

    Prospective,

    proactive

    information

    delivery

    2/15/2014 6IS ZC415

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    The Evolution of Data Analysis

  • 8/13/2019 IS ZC415-L1

    7/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Why Mine Data? Commercial Viewpoint

    Lots of data is being collected and warehoused

    Web data, e-commerce

    purchases at department/grocery stores

    Bank/Credit Card transactions Computers have become cheaper and more powerful

    Competitive Pressure is Strong

    Provide better, customized services for an edge (e.g.

    in Customer Relationship Management)

    2/15/2014 7IS ZC415

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

  • 8/13/2019 IS ZC415-L1

    8/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Why Mine Data? Scientific Viewpoint

    Data collected and stored at enormous speeds

    (GB/hour)

    remote sensors on a satellite

    telescopes scanning the skies microarrays generating gene expression data

    scientific simulations generating terabytes of data

    Traditional techniques infeasible for raw data

    Data mining may help scientists in classifying and segmenting data

    2/15/2014 8IS ZC415

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

  • 8/13/2019 IS ZC415-L1

    9/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    What is NOT Data Mining?

    Searching a phone number in a phone

    book

    Searching a keyword on Google

    Generating histograms of salaries for

    different age groups

    Issuing SQL query to a database andreading the reply

    2/15/2014 9IS ZC415

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

  • 8/13/2019 IS ZC415-L1

    10/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Data Mining is NOT

    Data Warehousing

    (Deductive) query processing

    SQL/ Reporting

    Software Agents

    Expert Systems

    Online Analytical Processing (OLAP) Statistical Analysis Tool

    Data visualization

    2/15/2014 10IS ZC415

  • 8/13/2019 IS ZC415-L1

    11/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    What is Data Mining?

    Discovery of useful summaries of data - Ullman Extracting or Mining knowledge from large

    amounts of data

    The efficient discovery of previously unknownpatterns in large databases

    Technology which predict future trends based on

    historical data

    It helps businesses make proactive and

    knowledge-driven decisions

    2/15/2014 11IS ZC415

  • 8/13/2019 IS ZC415-L1

    12/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Extraction of interesting (non-trivial, implicit,previously unknown and potentially useful)information or patterns from data in large

    databases

    What is Data Mining?

    2/15/2014 12IS ZC415

  • 8/13/2019 IS ZC415-L1

    13/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    What is Data Mining?

    Data mining is an integral part of

    knowledge discovery in databases

    (KDD), which is the overall process of

    converting raw data into usefulinformation. This process consists of

    series of transformation steps from

    preprocessing to postprocessing of datamining results

    2/15/2014 13IS ZC415

  • 8/13/2019 IS ZC415-L1

    14/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Process of Knowledge Discovery in Database(KDD)

    Data

    PreprocessingData Mining PostProcessing

    Normalization.

    Data subsetting

    Filtering

    Patterns,Visualizatio

    n,Pattern

    Interpretation

    Inputdata

    Input

    DataInformation

    2/15/2014 14IS ZC415

  • 8/13/2019 IS ZC415-L1

    15/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Data Mining:A KDD Process

    Data mining: the core of

    knowledge discovery

    process.

    Data Cleaning

    Data Integration

    Databases

    Data Warehouse

    Task-relevant Data

    Selection

    Data Mining

    Pattern Evaluation

    2/15/2014 15IS ZC415

  • 8/13/2019 IS ZC415-L1

    16/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Steps of a KDD Process

    Learning the application domain:

    relevant prior knowledge and goals of

    application

    Creating a target data set: data selection

    Data cleaning and preprocessing: (may

    take 60% of effort!)

    Data reduction and transformation:

    Find useful features, dimensionality/variable

    reduction, invariant representation.

    2/15/2014 16IS ZC415

  • 8/13/2019 IS ZC415-L1

    17/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Steps of a KDD Process

    Choosing functions of data mining

    summarization, classification, regression,

    association, clustering.

    Choosing the mining algorithm(s)

    Data mining: search for patterns of interest

    Pattern evaluation and knowledge presentation

    visualization, transformation, removingredundant patterns, etc.

    Use of discovered knowledge

    2/15/2014 17IS ZC415

  • 8/13/2019 IS ZC415-L1

    18/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Data Mining and Business Intelligence

    Increasing potentialto support

    business decisions End User

    Business

    Analyst

    Data

    Analyst

    DBA

    Making

    Decisions

    Data PresentationVisual ization Techniques

    Data Mining

    I nformation Discovery

    Data Exploration

    OLAP, MDA

    Statistical Analysis, Querying and Reporting

    Data Warehouses / Data Marts

    Data SourcesPaper, Fi les, Information Providers, Database Systems, OLTP

    2/15/2014 18IS ZC415

  • 8/13/2019 IS ZC415-L1

    19/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Data Mining: Confluence of Multiple Disciplines

    Data Mining

    DatabaseTechnology

    Statistics

    OtherDisciplines

    InformationScience

    MachineLearning

    Visualization

    2/15/2014 19IS ZC415

  • 8/13/2019 IS ZC415-L1

    20/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Data Mining vs. Statistical Analysis

    Statistical Analysis: Ill-suited for Nominal and Structured Data Types

    Completely data driven - incorporation of domain knowledge not possible

    Interpretation of results is difficult and daunting

    Requires expert user guidance

    Data Mining: Large Data sets

    Efficiency of Algorithms is important

    Scalability of Algorithms is important

    Real World Data

    Lots of Missing Values

    Pre-existing data - not user generated

    Data not static - prone to updates

    Efficient methods for data retrieval available for use

    2/15/2014 20IS ZC415

  • 8/13/2019 IS ZC415-L1

    21/41

  • 8/13/2019 IS ZC415-L1

    22/41BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Data Mining and Data Warehousing

    Data Warehouse: a centralized data repository which can be queried forbusiness benefit.

    Data Warehousing makes it possible to

    extract archived operational data

    overcome inconsistencies between different legacy data formats

    integrate data throughout an enterprise, regardless of location, format,or communication requirements

    incorporate additional or expert information

    OLAP: On-line Analytical Processing

    Multi-Dimensional Data Model (Data Cube)

    Operations: Roll-up

    Drill-down

    Slice and dice

    Rotate

    2/15/2014 22IS ZC415

  • 8/13/2019 IS ZC415-L1

    23/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    DBMS, OLAP, and Data Mining

    DBMS OLAP Data Mining

    TaskExtraction of detailed

    and summary data

    Summaries, trends and

    forecasts

    Knowledge discovery

    of hidden patterns

    and insights

    Type of result Information Analysis Insight and Prediction

    Method

    Deduction (Ask the

    question, verify

    with data)

    Multidimensional data

    modeling,

    Aggregation,

    Statistics

    Induction (Build the

    model, apply it to

    new data, get the

    result)

    Example question

    Who purchased

    mutual funds in

    the last 3 years?

    What is the average

    income of mutual

    fund buyers by

    region by year?

    Who will buy a

    mutual fund in the

    next 6 months and

    why?

    2/15/2014 23IS ZC415

    E l f DBMS OLAP d

  • 8/13/2019 IS ZC415-L1

    24/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Example of DBMS, OLAP and

    Data Mining: Weather Data

    Day outlook temperature humidity windy play

    1 sunny 85 85 false no

    2 sunny 80 90 true no

    3 overcast 83 86 false yes

    4 rainy 70 96 false yes

    5 rainy 68 80 false yes

    6 rainy 65 70 true no

    7 overcast 64 65 true yes

    8 sunny 72 95 false no

    9 sunny 69 70 false yes

    10 rainy 75 80 false yes

    11 sunny 75 70 true yes

    12 overcast 72 90 true yes

    13 overcast 81 75 false yes

    14 rainy 71 91 true no

    DBMS:

    2/15/2014 24IS ZC415

  • 8/13/2019 IS ZC415-L1

    25/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Example of DBMS, OLAP and

    Data Mining: Weather Data

    By querying a DBMS containing the above

    table we may answer questions like: What was the temperature in the sunny days? {85, 80, 72, 69,

    75} Which days the humidity was less than 75? {6, 7, 9, 11}

    Which days the temperature was greater than 70? {1, 2, 3, 8, 10,

    11, 12, 13, 14}

    Which days the temperature was greater than 70 and the

    humidity was less than 75? The intersection of the above two:{11}

    2/15/2014 25IS ZC415

  • 8/13/2019 IS ZC415-L1

    26/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Example of DBMS, OLAP and

    Data Mining: Weather Data

    OLAP:

    Using OLAP we can

    create a

    MultidimensionalModel of our data

    (Data Cube).

    For example using

    the dimensions: time,

    outlook and play we

    can create the

    following model.

    9 / 5 sunny rainy overcast

    Week 1 0 / 2 2 / 1 2 / 0

    Week 2 2 / 1 1 / 1 2 / 0

    2/15/2014 26IS ZC415

    E l f DBMS OLAP d

  • 8/13/2019 IS ZC415-L1

    27/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Example of DBMS, OLAP and

    Data Mining: Weather Data

    Data Mining:

    Using the ID3 algorithm we can produce

    the following decision tree: outlook = sunny

    humidity = high: no

    humidity = normal: yes

    outlook = overcast: yes

    outlook = rainy

    windy = true: no

    windy = false: yes

    2/15/2014 27IS ZC415

  • 8/13/2019 IS ZC415-L1

    28/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Major Issues in Data Mining

    Mining methodology and user interaction

    Mining different kinds of knowledge in databases

    Interactive mining of knowledge at multiple levels of abstraction

    Incorporation of background knowledge

    Data mining query languages and ad-hoc data mining Expression and visualization of data mining results

    Handling noise and incomplete data

    Pattern evaluation: the interestingness problem

    Performance and scalability Efficiency and scalability of data mining algorithms

    Parallel, distributed and incremental mining methods

    2/15/2014 28IS ZC415

  • 8/13/2019 IS ZC415-L1

    29/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Major Issues in Data Mining

    Issues relating to the diversity of data types

    Handling relational and complex types of data

    Mining information from heterogeneous databases and global

    information systems (WWW)

    Issues related to applications and social impacts Application of discovered knowledge

    Domain-specific data mining tools

    Intelligent query answering

    Process control and decision making

    Integration of the discovered knowledge with existing knowledge:

    A knowledge fusion problem

    Protection of data security, integrity, and privacy

    2/15/2014 29IS ZC415

  • 8/13/2019 IS ZC415-L1

    30/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Data Mining: Classification Schemes

    Decisions in data mining

    Kinds of databases to be mined

    Kinds of knowledge to be discovered

    Kinds of techniques utilized

    Kinds of applications adapted

    2/15/2014 30IS ZC415

  • 8/13/2019 IS ZC415-L1

    31/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Decisions in Data Mining

    Databases to be mined Relational, transactional, object-oriented, object-relational,

    active, spatial, time-series, text, multi-media, heterogeneous,

    legacy, WWW, etc.

    Knowledge to be mined

    Characterization, discrimination, association, classification,

    clustering, trend, deviation and outlier analysis, etc.

    Multiple/integrated functions and mining at multiple levels

    Techniques utilized

    Database-oriented, data warehouse (OLAP), machine learning,statistics, visualization, neural network, etc.

    Applications adapted

    Retail, telecommunication, banking, fraud analysis, DNA mining,

    stock market analysis, Web mining, Weblog analysis, etc.

    2/15/2014 31IS ZC415

  • 8/13/2019 IS ZC415-L1

    32/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Data Mining Tasks

    Data Mining is generally divided into two

    tasks.

    1. Predictive tasks

    2. Descriptive tasks

    2/15/2014 32IS ZC415

  • 8/13/2019 IS ZC415-L1

    33/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Predictive Tasks

    Objective: Predict the value of a specific

    attribute (target/dependent

    variable)based on the value of other

    attributes (explanatory).

    Example1: Judge if a patient has specific

    disease based on his/her medical tests

    results.

    2/15/2014 33IS ZC415

  • 8/13/2019 IS ZC415-L1

    34/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Predictive Model

    Example2: Credit Card Company Every purchase is placed in 1 of 4 classes

    Authorize

    Ask for further identification before authorizing

    Do not authorize

    Do not authorize but contact police

    Two functions of Data Mining

    Examine historical data to determine how the data

    fit into 4 classes

    Apply the model to each new purchase

    2/15/2014 34IS ZC415

  • 8/13/2019 IS ZC415-L1

    35/41

    D t Mi i t k

  • 8/13/2019 IS ZC415-L1

    36/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Data Mining tasks

    Figure taken from M H Dunham book on Data Mining

    DATAMINING

    Predictive

    Classification

    Regression

    Time Series

    Analysis

    Prediction

    Descriptive

    Clustering

    Summarization

    Association

    Rules

    Sequence

    Discovery

    2/15/2014 36IS ZC415

  • 8/13/2019 IS ZC415-L1

    37/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Summary

    2/15/2014 IS ZC415 37

  • 8/13/2019 IS ZC415-L1

    38/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Conferences and Journals

    on Data Mining

    KDD Conferences

    ACM SIGKDD Int. Conf. on

    Knowledge Discovery in

    Databases and Data Mining

    (KDD)

    SIAM Data Mining Conf. (SDM) (IEEE) Int. Conf. on Data Mining

    (ICDM)

    Conf. on Principles and practices

    of Knowledge Discovery and

    Data Mining (PKDD)

    Pacific-Asia Conf. on Knowledge

    Discovery and Data Mining

    (PAKDD)

    Other related conferences ACM SIGMOD

    VLDB

    (IEEE) ICDE

    WWW, SIGIR

    ICML, CVPR, NIPS

    Journals

    Data Mining and Knowledge

    Discovery (DAMI or DMKD)

    IEEE Trans. On Knowledge and

    Data Eng. (TKDE)

    KDD Explorations

    ACM Trans. on KDD2/15/2014 38IS ZC415

    Where to Find References?

  • 8/13/2019 IS ZC415-L1

    39/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Where to Find References?

    DBLP, CiteSeer, Google

    Data mining and KDD Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.

    Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD

    Database systems

    Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA

    Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.

    AI & Machine Learning

    Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS,

    etc.

    Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems,

    IEEE-PAMI, etc.

    Web and IR

    Conferences: SIGIR, WWW, CIKM, etc.

    Journals: WWW: Internet and Web Information Systems, Statistics

    Conferences: Joint Stat. Meeting, etc.

    Journals: Annals of statistics, etc.

    Visualization

    Conference proceedings: CHI, ACM-SIGGraph, etc.

    Journals: IEEE Trans. visualization and computer graphics, etc.2/15/2014 39IS ZC415

  • 8/13/2019 IS ZC415-L1

    40/41

    BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

    Recommended Reference Books

    S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data.

    Morgan Kaufmann, 2002

    R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000

    T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003

    U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge

    Discovery and Data Mining. AAAI/MIT Press, 1996

    U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and KnowledgeDiscovery, Morgan Kaufmann, 2001

    J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006

    D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001

    T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining,

    Inference, and Prediction, Springer-Verlag, 2001

    B. Liu, Web Data Mining, Springer 2006.

    T. M. Mitchell, Machine Learning, McGraw Hill, 1997 G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press,

    1991

    S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998

    I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with

    Java Implementations, Morgan Kaufmann, 2nd ed. 2005

    2/15/2014 40IS ZC415

  • 8/13/2019 IS ZC415-L1

    41/41

    Web Resources

    Text book's website at:

    http://wwwusers.cs.umn.edu/~kumar/dmbook/.

    PowerPoints for the text book are available viaanonymous ftp at:

    ftp://ftp.aw.com/cseng/authors/tan

    Other Resources:

    http://web.ccsu.edu/datamining/resources.html

    2/15/2014 41IS ZC415

    http://www-users.cs.umn.edu/~kumar/dmbook/ftp://ftp.aw.com/cseng/authors/tanhttp://web.ccsu.edu/datamining/resources.htmlhttp://web.ccsu.edu/datamining/resources.htmlftp://ftp.aw.com/cseng/authors/tanhttp://www-users.cs.umn.edu/~kumar/dmbook/