Data Mining - biomisa.orgbiomisa.org/wp-content/uploads/2019/10/Lect-1-DM.pdf · • The Elements...

Preview:

Citation preview

1

Data Mining

Lecture # 1Introduction & Fundamentals

2

Intro & AffiliationsArea of research: Analysis of medical images/signals using Image/signal

processing and Machine Learning Techniques

www.biomisa.org/usman

www.biomisa.org

www.risetech.pk

www.albasr.com

www.ekko.pk

Reference Material

Text Book:

Data Mining --- Concepts and techniques, by Han and Kamber, Morgan Kaufmann, 3rd Edition. (ISBN:1-

55860-489-8)

Ref Books:

• Introduction to Data Mining – Pang-Ning Tan, Michael Steinbach, and Vipin Kumar,

Addison Wesley

• Principles of Data Mining, by Hand, Mannila, and Smyth, MIT Press, 2001. (ISBN:0-262-

08290-X)

• The Elements of Statistical Learning --- Data Mining, Inference, and Prediction, by

Hastie, Tibshirani, and Friedman, Springer, 2001. (ISBN:0-387-95284-5)

• Mining the Web --- Discovering Knowledge from Hypertext Data, by Chakrabarti,

Morgan Kaufmann, 2003. (ISBN:1-55860-754-4)

3

Reference Material II

• Software:– Weka : Data Mining Software in Java, by University of

Waikato, New Zealand– RapidMiner– GeNIe & SMILE, developed at the Decision Systems

Laboratory, University of Pittsburgh– bnlearn - an R package for Bayesian network learning

and inference– . . .

• Website:– http://www.kdnuggets.com/– ….

4

Topics

• Scope: Data Mining• Topics:

– Introduction to Data Mining– Data Understanding – Data Preprocessing– Data Ware Housing– Data Cube Technology– Mining Frequent Patterns– Advanced Pattern Mining– Classification– Advanced Classification Methods– Clustering – Outlier Detection

5

Grading

• Assignments 10%

• Quizzes 10%

• Project 10%

• Mid-Term Exam 30%

• Final Exam 40%

6

Assignment and Project

• Assignments– No assignments will be accepted after due date.– Programming assignments should be well

documented.– Students are “not” allowed to “copy” each other’s

work. Any such work would be marked zero– No tolerance to cheating. If you are not able to

explain your assignment, it will be considered cheating.

• Projects– Applying data mining techniques to solve actual

problems. 7

DATA MINING

8

9

Definition“Data mining is the exploration and analysis of large

quantities of data in order to discover valid, novel, potentially useful, and ultimately understandablepatterns in data.”

Valid: The patterns hold in general.

Novel: We did not know the pattern beforehand.

Useful: We can devise actions from the patterns.

Understandable: We can interpret and comprehend the patterns.

Alternative names

– Knowledge discovery (mining) in databases (KDD)

– Knowledge extraction,

– Knowledge engineering

– Data Science

– Data/pattern analysis

– Data archeology

– Data dredging

– Information harvesting

– Business intelligence

– etc.

10

We will return to the actual topic in two minutes. In the meantime, we are going to play a quick game.

I am going to show you some problems which were shown to pigeons!

Let us see if you are as smart as a pigeon!

Examples of class A

3 4

1.5 5

6 8

2.5 5

Examples of class B

5 2.5

5 2

8 3

4.5 3

Pigeon Problem 1

Examples of class A

3 4

1.5 5

6 8

2.5 5

Examples of class B

5 2.5

5 2

8 3

4.5 3

8 1.5

4.5 7

What class is this object?

What about this one, A or B?

Pigeon Problem 1

Examples of class A

3 4

1.5 5

6 8

2.5 5

Examples of class B

5 2.5

5 2

8 3

4.5 3

8 1.5

This is a B!Pigeon Problem 1

Here is the rule.If the left bar is smaller than the right bar, it is an A, otherwise it is a B.

Examples of class A

4 4

5 5

6 6

3 3

Examples of class B

5 2.5

2 5

5 3

2.5 3

8 1.5

7 7

Even I know this one

Pigeon Problem 2 Oh! This ones hard!

Examples of class A

4 4

5 5

6 6

3 3

Examples of class B

5 2.5

2 5

5 3

2.5 3

7 7

Pigeon Problem 2

So this one is an A.

The rule is as follows, if the two bars are equal sizes, it is an A. Otherwise it is a B.

Examples of class A

4 4

1 5

6 3

3 7

Examples of class B

5 6

7 5

4 8

7 7

6 6

Pigeon Problem 3

This one is really hard!What is this, A or B?

Examples of class A

4 4

1 5

6 3

3 7

Examples of class B

5 6

7 5

4 8

7 7

6 6

Pigeon Problem 3 It is a B!

The rule is as follows, if the sum of the two bars is less than or equal to 10, it is an A. Otherwise it is a B.

Examples of class A

3 4

1.5 5

6 8

2.5 5

Examples of class B

5 2.5

5 2

8 3

4.5 3

Pigeon Problem 1

Here is the rule again.If the left bar is smaller than the right bar, it is an A, otherwise it is a B.

Lef

t B

ar

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Right Bar

Examples of class A

4 4

5 5

6 6

3 3

Examples of class B

5 2.5

2 5

5 3

2.5 3

Pigeon Problem 2

Lef

t B

ar

10

1 2 3 4 5 6 7 8 9 10

1

2

3

4

5

6

7

8

9

Right Bar

Let me look it up… here it is.. the rule is, if the two bars are equal sizes, it is an A. Otherwise it is a B.

Examples of class A

4 4

1 5

6 3

3 7

Examples of class B

5 6

7 5

4 8

7 7

Pigeon Problem 3

Lef

t B

ar

100

10 20 30 40 50 60 70 80 90 100

10

20

30

40

50

60

70

80

90

Right Bar

The rule again:if the square of the sum of the two bars is less than or equal to 100, it is an A. Otherwise it is a B.

Why Mine Data? Commercial Viewpoint• Lots of data is being collected

and warehoused

– Web data, e-commerce

– purchases at department/grocery stores

– Bank/Credit Card transactions

22

24

25

26

A Single View to the Customer

Customer

Social Media

Gaming

Entertain

BankingFinance

OurKnownHistory

Purchase

Variety (Complexity)

• Relational Data (Tables/Transaction/Legacy Data)• Text Data (Web)• Semi-structured Data (XML) • Graph Data

– Social Network, Semantic Web (RDF), …

• Streaming Data – You can only scan the data once

• A single application can be generating/collecting many types of data

• Big Public Data (online, weather, finance, etc)

28

To extract knowledge all these types of data need to linked together

Evolution of Sciences• Before 1600, empirical science

• 1600-1950s, theoretical science

– Each discipline has grown a theoretical component. Theoretical models often

motivate experiments and generalize our understanding.

• 1950s-1990s, computational science

– Over the last 50 years, most disciplines have grown a third, computational branch

(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)

– Computational Science traditionally meant simulation. It grew out of our inability

to find closed-form solutions for complex mathematical models.

• 1990-now, data science

– The flood of data from new scientific instruments and simulations

– The ability to economically store and manage petabytes of data online

– The Internet and computing Grid that makes all these archives universally

accessible

– Scientific info. management, acquisition, organization, query, and visualization

tasks scale almost linearly with data volumes. Data mining is a major new

challenge!29

Evolution of Database Technology

30

What is (not) Data Mining?

What is Data Mining?

– Certain names are more prevalent in certain locations (O’Brien, O’Rurke, O’Reilly… in Boston area)

–Identify customers with similar buying habits

–Find all credit applicants who are poor credit risks.

What is not Data Mining?

– Look up phone number in phone directory

– Identify customers who have purchased more than $10,000 in the last month.

–Find all credit applicants with last name of Smith.

31

32

Knowledge Discovery (KDD) Process

• This is a view from typical database systems and data warehousing communities

• Data mining plays an essential role in the knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

33

A Mining Framework

• Mining usually involves

– Data cleaning

– Data integration from multiple sources

– Warehousing the data

– Data cube construction

– Data selection for data mining

– Data mining

– Presentation of the mining results

– Patterns and knowledge to be used or stored into

knowledge-base

34

Data Mining in Business Intelligence

Increasing potential

to support

business decisions End User

Business

Analyst

Data

Analyst

DBA

DecisionMaking

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data Exploration

Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data Sources

Paper, Files, Web documents, Scientific experiments, Database Systems

35

Mining vs. Data Exploration

• Business intelligence view

– Warehouse, data cube, reporting but not much mining

• Business objects vs. data mining tools

• Supply chain example: tools

• Data presentation

• Exploration

36

KDD Process: A Typical View from ML and Statistics

Input Data Data Mining

Data Pre-Processing

Post-Processing

• This is a view from typical machine learning and statistics communities

Data integration

Normalization

Feature selection

Dimension reduction

Pattern discoveryAssociation & correlationClassificationClusteringOutlier analysis… … … …

Pattern evaluation

Pattern selection

Pattern interpretation

Pattern visualization

37

Example: Medical Data Mining

• Health care & medical data mining – often

adopted such a view in statistics and machine

learning

• Preprocessing of the data (including feature

extraction and dimension reduction)

• Classification or/and clustering processes

• Post-processing for presentation

• Draws ideas from: machine learning/AI, statistics, and database systems

etc.

Origins of Data Mining

Data Mining

Database

TechnologyStatistics

Machine

Learning

Pattern

RecognitionAlgorithm

Other

Disciplines

Visualization

38

What is Machine Learning?

• Machine Learning– Study of algorithms that

– improve their performance

– at some task

– with experience

• Optimize a performance criterion using example data or past experience.

• Role of Statistics: Inference from a sample

• Role of Computer science: Efficient algorithms to– Solve the optimization problem

– Representing and evaluating the model for inference

Machine Learning

• According to Herbert Simon, learning is, “Any changein a System that allows it to perform better thesecond time on repetition of the same task or onanother task drawn from the same population.” [G. F.Luger and W. A. Stubblefield, Artificial Intelligence:Structures and Strategies for Complex ProblemSolving, The Benjamin/Cummings PublishingCompany, Inc. 1989.]

41

Why “Learn”?• Machine learning is programming computers to

optimize a performance criterion using example data or past experience.

• Learning is used when:– Human expertise does not exist (navigating on Mars),– Humans are unable to explain their expertise (speech

recognition)– Solution changes in time (routing on a computer

network)– Solution needs to be adapted to particular cases (user

biometrics)

The machine learning

pipeline

43

ML Methods

• Supervised Learning

– Classification

– Regression/Prediction

• Unsupervised Learning

• Association Analysis

Predicting house prices

Sentiment analysis

Document

retrieval

Product

recommendation

Product recommendation

Visual Product

recommender

Model Choice

– What type of classifier shall we use? How shall we select its parameters? Is there best classifier...?

– How do we train...? How do we adjust the parameters of the model (classifier) we picked so that the model fits the data?

Features

• Features: a set of variables believed to carry discriminating and characterizing information about the objects under consideration

• Feature vector: A collection of d features, ordered in some meaningful way into a d- dimensional column vector, that represents the signature of the object to be identified.

• Feature space: The d-dimensional space in which the feature vectors lie. A d-dimensional vector in a d-dimensional space constitutes a point in that space.

Features

Feature space (3D)

Features

• Feature Choice

– Good Features

• Ideally, for a given group of patterns coming from the same class, feature values should all be similar

• For patterns coming from different classes, the feature values should be different.

– Bad Features

• irrelevant, noisy, outlier?

Features

“Good” features “Bad” features

Linear separability Non-linear separability Highly correlated features Multi-modal

Readings from Book (3rd Edn.)

• Chapter – 1

Acknowledgments

• Lecture slides are adopted from Data mining-Concepts and Techniques by Han, Kamber and Pei https://hanj.cs.illinois.edu/bk3/bk3_slidesindex.htm

• Lecture slides are adopted from lectures of Dr. Aman Ullah, SS CASE IT, Islamabad

• Lecture series https://www.youtube.com/watch?v=h-q582wpb4Q&list=PLYwpaL_SFmcChP0xiW3KK9elNuhfCLVVi

• Lecture series https://www.youtube.com/watch?v=wAbyG4M2gns&t=1751s

• http://www.cs.uoi.gr/~tsap/teaching/2012f-cs059/slides-en.html

57

Recommended