Data Mining - biomisa.orgbiomisa.org/wp-content/uploads/2019/10/Lect-1-DM.pdf · • The Elements...

Data Mining

Lecture # 1Introduction & Fundamentals

Intro & AffiliationsArea of research: Analysis of medical images/signals using Image/signal

processing and Machine Learning Techniques

www.biomisa.org/usman

www.biomisa.org

www.risetech.pk

www.albasr.com

www.ekko.pk

Reference Material

Text Book:

Data Mining --- Concepts and techniques, by Han and Kamber, Morgan Kaufmann, 3rd Edition. (ISBN:1-

55860-489-8)

Ref Books:

• Introduction to Data Mining – Pang-Ning Tan, Michael Steinbach, and Vipin Kumar,

Addison Wesley

• Principles of Data Mining, by Hand, Mannila, and Smyth, MIT Press, 2001. (ISBN:0-262-

08290-X)

• The Elements of Statistical Learning --- Data Mining, Inference, and Prediction, by

Hastie, Tibshirani, and Friedman, Springer, 2001. (ISBN:0-387-95284-5)

• Mining the Web --- Discovering Knowledge from Hypertext Data, by Chakrabarti,

Morgan Kaufmann, 2003. (ISBN:1-55860-754-4)

Reference Material II

• Software:– Weka : Data Mining Software in Java, by University of

Waikato, New Zealand– RapidMiner– GeNIe & SMILE, developed at the Decision Systems

Laboratory, University of Pittsburgh– bnlearn - an R package for Bayesian network learning

and inference– . . .

• Website:– http://www.kdnuggets.com/– ….

Topics

• Scope: Data Mining• Topics:

– Introduction to Data Mining– Data Understanding – Data Preprocessing– Data Ware Housing– Data Cube Technology– Mining Frequent Patterns– Advanced Pattern Mining– Classification– Advanced Classification Methods– Clustering – Outlier Detection

Grading

• Assignments 10%

• Quizzes 10%

• Project 10%

• Mid-Term Exam 30%

• Final Exam 40%

Assignment and Project

• Assignments– No assignments will be accepted after due date.– Programming assignments should be well

documented.– Students are “not” allowed to “copy” each other’s

work. Any such work would be marked zero– No tolerance to cheating. If you are not able to

explain your assignment, it will be considered cheating.

• Projects– Applying data mining techniques to solve actual

problems. 7

DATA MINING

Definition“Data mining is the exploration and analysis of large

quantities of data in order to discover valid, novel, potentially useful, and ultimately understandablepatterns in data.”

Valid: The patterns hold in general.

Novel: We did not know the pattern beforehand.

Useful: We can devise actions from the patterns.

Understandable: We can interpret and comprehend the patterns.

Alternative names

– Knowledge discovery (mining) in databases (KDD)

– Knowledge extraction,

– Knowledge engineering

– Data Science

– Data/pattern analysis

– Data archeology

– Data dredging

– Information harvesting

– Business intelligence

– etc.

We will return to the actual topic in two minutes. In the meantime, we are going to play a quick game.

I am going to show you some problems which were shown to pigeons!

Let us see if you are as smart as a pigeon!

Examples of class A

Examples of class B

Pigeon Problem 1

Examples of class A

Examples of class B

What class is this object?

What about this one, A or B?

Pigeon Problem 1

Examples of class A

Examples of class B

This is a B!Pigeon Problem 1

Here is the rule.If the left bar is smaller than the right bar, it is an A, otherwise it is a B.

Examples of class A

Examples of class B

Even I know this one

Pigeon Problem 2 Oh! This ones hard!

Examples of class A

Examples of class B

Pigeon Problem 2

So this one is an A.

The rule is as follows, if the two bars are equal sizes, it is an A. Otherwise it is a B.

Examples of class A

Examples of class B

Pigeon Problem 3

This one is really hard!What is this, A or B?

Examples of class A

Examples of class B

Pigeon Problem 3 It is a B!

The rule is as follows, if the sum of the two bars is less than or equal to 10, it is an A. Otherwise it is a B.

Examples of class A

Examples of class B

Pigeon Problem 1

Here is the rule again.If the left bar is smaller than the right bar, it is an A, otherwise it is a B.

1 2 3 4 5 6 7 8 9 10

Right Bar

Examples of class A

Examples of class B

Pigeon Problem 2

1 2 3 4 5 6 7 8 9 10

Right Bar

Let me look it up… here it is.. the rule is, if the two bars are equal sizes, it is an A. Otherwise it is a B.

Examples of class A

Examples of class B

Pigeon Problem 3

10 20 30 40 50 60 70 80 90 100

Right Bar

The rule again:if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B.

Why Mine Data? Commercial Viewpoint• Lots of data is being collected

and warehoused

– Web data, e-commerce

– purchases at department/grocery stores

– Bank/Credit Card transactions

A Single View to the Customer

Customer

Social Media

Gaming

Entertain

BankingFinance

OurKnownHistory

Purchase

Variety (Complexity)

• Relational Data (Tables/Transaction/Legacy Data)• Text Data (Web)• Semi-structured Data (XML) • Graph Data

– Social Network, Semantic Web (RDF), …

• Streaming Data – You can only scan the data once

• A single application can be generating/collecting many types of data

• Big Public Data (online, weather, finance, etc)

To extract knowledge all these types of data need to linked together

Evolution of Sciences• Before 1600, empirical science

• 1600-1950s, theoretical science

– Each discipline has grown a theoretical component. Theoretical models often

motivate experiments and generalize our understanding.

• 1950s-1990s, computational science

– Over the last 50 years, most disciplines have grown a third, computational branch

(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)

– Computational Science traditionally meant simulation. It grew out of our inability

to find closed-form solutions for complex mathematical models.

• 1990-now, data science

– The flood of data from new scientific instruments and simulations

– The ability to economically store and manage petabytes of data online

– The Internet and computing Grid that makes all these archives universally

accessible

– Scientific info. management, acquisition, organization, query, and visualization

tasks scale almost linearly with data volumes. Data mining is a major new

challenge!29

Evolution of Database Technology

What is (not) Data Mining?

What is Data Mining?

– Certain names are more prevalent in certain locations (O’Brien, O’Rurke, O’Reilly… in Boston area)

–Identify customers with similar buying habits

–Find all credit applicants who are poor credit risks.

What is not Data Mining?

– Look up phone number in phone directory

– Identify customers who have purchased more than $10,000 in the last month.

–Find all credit applicants with last name of Smith.

Knowledge Discovery (KDD) Process

• This is a view from typical database systems and data warehousing communities

• Data mining plays an essential role in the knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

A Mining Framework

• Mining usually involves

– Data cleaning

– Data integration from multiple sources

– Warehousing the data

– Data cube construction

– Data selection for data mining

– Data mining

– Presentation of the mining results

– Patterns and knowledge to be used or stored into

knowledge-base

Data Mining in Business Intelligence

Increasing potential

to support

business decisions End User

Business

Analyst

DecisionMaking

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data Sources

Paper, Files, Web documents, Scientific experiments, Database Systems

Mining vs. Data Exploration

• Business intelligence view

– Warehouse, data cube, reporting but not much mining

• Business objects vs. data mining tools

• Supply chain example: tools

• Data presentation

• Exploration

KDD Process: A Typical View from ML and Statistics

Input Data Data Mining

Data Pre-Processing

Post-Processing

• This is a view from typical machine learning and statistics communities

Data integration

Normalization

Feature selection

Dimension reduction

Pattern discoveryAssociation & correlationClassificationClusteringOutlier analysis… … … …

Pattern evaluation

Pattern selection

Pattern interpretation

Pattern visualization

Example: Medical Data Mining

• Health care & medical data mining – often

adopted such a view in statistics and machine

learning

• Preprocessing of the data (including feature

extraction and dimension reduction)

• Classification or/and clustering processes

• Post-processing for presentation

• Draws ideas from: machine learning/AI, statistics, and database systems

Origins of Data Mining

Data Mining

Database

TechnologyStatistics

Machine

Learning

Pattern

RecognitionAlgorithm

Disciplines

Visualization

What is Machine Learning?

• Machine Learning– Study of algorithms that

– improve their performance

– at some task

– with experience

• Optimize a performance criterion using example data or past experience.

• Role of Statistics: Inference from a sample

• Role of Computer science: Efficient algorithms to– Solve the optimization problem

– Representing and evaluating the model for inference

Machine Learning

• According to Herbert Simon, learning is, “Any changein a System that allows it to perform better thesecond time on repetition of the same task or onanother task drawn from the same population.” [G. F.Luger and W. A. Stubblefield, Artificial Intelligence:Structures and Strategies for Complex ProblemSolving, The Benjamin/Cummings PublishingCompany, Inc. 1989.]

Why “Learn”?• Machine learning is programming computers to

optimize a performance criterion using example data or past experience.

• Learning is used when:– Human expertise does not exist (navigating on Mars),– Humans are unable to explain their expertise (speech

recognition)– Solution changes in time (routing on a computer

network)– Solution needs to be adapted to particular cases (user

biometrics)

The machine learning

pipeline

ML Methods

• Supervised Learning

– Classification

– Regression/Prediction

• Unsupervised Learning

• Association Analysis

Predicting house prices

Sentiment analysis

Document

retrieval

Product

recommendation

Product recommendation

Visual Product

recommender

Model Choice

– What type of classifier shall we use? How shall we select its parameters? Is there best classifier...?

– How do we train...? How do we adjust the parameters of the model (classifier) we picked so that the model fits the data?

Features

• Features: a set of variables believed to carry discriminating and characterizing information about the objects under consideration

• Feature vector: A collection of d features, ordered in some meaningful way into a d- dimensional column vector, that represents the signature of the object to be identified.

• Feature space: The d-dimensional space in which the feature vectors lie. A d-dimensional vector in a d-dimensional space constitutes a point in that space.

Features

Feature space (3D)

Features

• Feature Choice

– Good Features

• Ideally, for a given group of patterns coming from the same class, feature values should all be similar

• For patterns coming from different classes, the feature values should be different.

– Bad Features

• irrelevant, noisy, outlier?

Features

“Good” features “Bad” features

Linear separability Non-linear separability Highly correlated features Multi-modal

Readings from Book (3rd Edn.)

• Chapter – 1

Acknowledgments

• Lecture slides are adopted from Data mining-Concepts and Techniques by Han, Kamber and Pei https://hanj.cs.illinois.edu/bk3/bk3_slidesindex.htm

• Lecture slides are adopted from lectures of Dr. Aman Ullah, SS CASE IT, Islamabad

• Lecture series https://www.youtube.com/watch?v=h-q582wpb4Q&list=PLYwpaL_SFmcChP0xiW3KK9elNuhfCLVVi

• Lecture series https://www.youtube.com/watch?v=wAbyG4M2gns&t=1751s

• http://www.cs.uoi.gr/~tsap/teaching/2012f-cs059/slides-en.html

Data Mining - biomisa.orgbiomisa.org/wp-content/uploads/2019/10/Lect-1-DM.pdf · • The Elements...

Documents

Introduction au Data Mining et à l’apprentissage statistiquecedric.cnam.fr/~saporta/DM.pdf · 2012-10-31 · 3. Performance des méthodes de prévision 4. Construction et choix

Machine Learning CSE546 - courses.cs.washington.edu · 2018-12-08 · 䡦The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Trevor Hastie, Robert Tibshirani,

Selective Sequential Model Selection - arXivSelective Sequential Model Selection William Fithian, Jonathan Taylor, Robert Tibshirani, and Ryan J. Tibshirani December 9, 2015 Abstract

Data Mining - biomisa.orgbiomisa.org/wp-content/uploads/2019/10/Lect-11-DM.pdf · • Partitioning approach: – Construct various partitions and then evaluate them by some criterion,

Ryan Tibshirani Data Mining: 36-462/36-662 January 22 …ryantibs/datamining/lectures/03-pr.pdf · PageRank Ryan Tibshirani Data Mining: 36-462/36-662 January 22 2013 Optional reading:

Ryan Tibshirani Data Mining: 36-462/36-662 April 23 2013ryantibs/datamining/lectures/24-bag.pdf · Bagging Ryan Tibshirani Data Mining: 36-462/36-662 April 23 2013 Optional reading:

Data mining: rule mining algorithmspeople.sissa.it/~inno/hpc.sissa.it/dm/dm.pdf · Data mining: rule mining algorithms ... we call confidence of A=>B and indicate with c ... All algorithms

BPC 10 BW to BPC through DM.pdf

Технология Data Mining Интеллектуальный …kek.ksu.ru/eos/dm.pdf“под разными углами”. Хотя технология OLAP позволяет

guide oms evaluation besoin dm.pdf

Bradley Efron, R.J. Tibshirani an Introduction to Bootstrap

The Elements of Statistical Learning Springer Series in Statistics Trevor Hastie Robert Tibshirani Jerome Friedman The Elements of Statistical Learning Data Mining, Inference, and

An Introduction to Advanced Analytics and Data Miningbarryanalytics.com/Downloads/Presentations/AA and DM.pdf · An Introduction to Advanced Analytics and Data Mining Dr Barry Leventhal

R.M. N° 111-2013 MEM-DM.pdf

Ridge and Lasso Regression · The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Second Edition, February 2009, Trevor Hastie, Robert Tibshirani, Jerome

Data Mining, Parallelism, and Grids - Queen's Universityresearch.cs.queensu.ca/home/skill/dm.pdf · batch size is critical. ... each repository holds some; ... Data mining is a major

COMP 6838 Data MIningzIan Witten and Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd Edition, Morgan Kaufmann, 2005. zTrevor Hastie, Robert Tibshirani,

kasus DM.pdf

Modelli matematici e Data Mining - na.icar.cnr.itmariog/Lucidi/04LSIA-DM.pdf · Mario Guarracino Laboratorio di Sistemi Informativi Aziendali a.a. 2006/2007 Interpretazione e predizione

Ryan Tibshirani Data Mining: 36-462/36-662 January 17 2013ryantibs/datamining/lectures/02-ir.pdf · Information retrieval Ryan Tibshirani Data Mining: 36-462/36-662 January 17 2013