24
Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Embed Size (px)

Citation preview

Page 1: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Chapter One

Introduction

Page 2: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Chapter Overview

• Roles of data, information and knowledge• Background of data mining• What is data mining?• Main data mining objectives• Data mining and other related disciplines• Current state of data mining• Promises and challenges • A brief preview of data mining tool Weka

Page 3: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Data, Information and Knowledge• Data (D)

– Isolated factual recording of separate objects and events

– Enables the recording of the seen events

• Information (I)– Fact of meaningful context represented by

relationships between isolated data items– Information enables the responding to the seen

events

• Knowledge (K) – Verified known information that is accommodated

into the business process– Enable the anticipation of the unseen events

D

I

K

Page 4: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Data Mining: The Background

• Computerisation of operations in commercial, governmental and scientific organisations has resulted in large volumes of operational data, e.g.

– Itemised telephone bills– Bank statements– Supermarket transactions– Share prices– Scientific experimental data sets– Published web pages– CCTV video footages– ……

Page 5: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Data Mining: The Background• Facts:

– Storing the data is an operational necessity– Storing the data has become easy and affordable– Data acquisition is fully or partially automatic and fast

• Consequences:– The speed of data comprehension does not match the

speed of data acquisition– Many commercial database management systems

(DBMSs) are not equipped with data comprehension and analysis tools.

– We may be data rich, but information poor.

Page 6: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Data Mining: The Background

• An intriguing quotable quote:

“I know half the money I spend on advertising is wasted, but I can never find out which half!”

Lord Leverhulme President of Unilever

Page 7: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Data Mining: What it is

• Useful information; leading to a course of action or an understanding of data

• Non-trivial implicit information; not the raw data, nor the result of a simple data summary

• Real life databases; not laboratory generated data sets• Efficient novel discovery methods; expected to be

scaled up and applied to large databases

Knowledge discovery in databases (KDD) refers to the efficient process of searching through large volumes of raw data in databases to find potentially useful information that is implicitly embedded in the data. Data Mining is an integral step of KDD that discovers hidden patterns from an input data set.

Page 8: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Data Mining: Useful Information

Example 1 (A well-known example, not a joke):

Customers who purchase beer are also likely (say 90%) to purchase nappies.

Example 2 (May already be in practical use in credit card applications):

If 20,000 Customer’s Salary 40,000 pounds and Customer has a house, then Customer is a safe customer.

Page 9: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Data Mining: Non-trivial Information• Putting the “search for information” into a spectrum:

Low

end

of

soph

istic

atio

n

Hig

h en

d of

soph

istic

atio

nData retrieval Online analytic processing Data mining

• Retrieval of stored data• Trivial data aggregation• Written in standard SQL

• Interactive reporting on stored data

• Summarisation and drilling along different attributes

• Written in extended SQL

• Discovery of hidden and embedded patterns

• Discovery algorithms• Written in programming

language probably with the assistance of SQL

Page 10: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Data Mining: Real-life Databases• Characteristics of a real-life database

– The size may be extremely large

– The dimensionality can be very high

– Attributes can be of different data types

– Data quality can be very poor

– Data may exist in pieces and isolated in different systems

– Value distribution can be extremely skewed

– Database content can be dynamic and evolving

– Data may lack traditional record-based structure

– Data are available on second storage media

Page 11: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Data Mining: Efficient Algorithms• Discovering interesting patterns supported by given facts

can be computationally hard because many discoveries are combinatorial problems. Trivial algorithms may take too long.

• A discovery algorithm is considered efficient if its execution time and memory requirement are comparable to those of sorting algorithms; otherwise, it is unlikely to scale up well enough to cope with data sets of large sizes.

• Efficient discovery algorithms may be hard to find. Using advanced hardware, optimising the implementation of the algorithms and developing approximate solutions can be viable alternative options.

Page 12: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Data Mining Objectives

• Classification– Using existing data to form a classification model and then

using the model to assign an appropriate class label for a data record (e.g. safe vs. risky customers)

• Estimation– Similar to classification but to assign a value to an output

variable of a data record (e.g. estimated house value)• Prediction

– Similar to classification and estimation, but more concerned with future outcome of the output (e.g. tomorrow’s weather)

• Description– General description of data characteristics (e.g. customer

profile)

Page 13: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Data Mining & Other Disciplines

Machine Learning(Artificial Intelligence)

Statistics

Database Management

DATA MINING

Fast storage structures & retrieval operations

Data analysis theoriesmethods and measures

Inductive & deductive learning methods

Page 14: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Data Mining: Current State• Many data mining algorithms have been developed or

adapted

• Many data mining software tools have been built and are in use

• A cross-industry methodology has been formed

• Besides general solutions, more application-oriented data mining solutions are being developed

• More and more organisations are either doing their own data mining or hiring consultants to do the job

• Data mining has been extended to web mining and text mining

Page 15: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Data Mining: Current State

• Some nuisances– Mining cookies – Spyware and miningware– Intrusion to privacy

• Some serious problems– “Big Brother is watching”– Unfair advantages in trading practice e.g. high-

frequency trading (HFT)– Abuse of personal data– Ethical concerns

Page 16: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Data Mining: Promises

• Areas of data mining application:– Finance and insurance– Marketing and sales– Medicine– Agriculture– Society, politics and economics– Science – Engineering– Law enforcement– Military and intelligence (classified)

Page 17: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Data Mining: Challenges Faced

• Some difficult problems to solve– Extremely large data sets – Extremely high dimensionalities (curse of dimensions)– Combinatorial problems and fast algorithms– Meaningful evaluation of the patterns– Discovery of changing and evolving patterns– Integration of data mining techniques– Comprehensibility of patterns – Data pre-processing– Mining non-standard complex data such as multimedia

materials

Page 18: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Weka: A Brief Introduction

• Overview– Java tool set developed at Univ. of Waikato (NZ) – Free to download and used by many– A wide range of learning and data pre-processing

methods and algorithms, with Java API– Offering a GUI (Explorer) and a command-line (Simple

CLI) interface to the tools– Experimenter module to assist the evaluation of

classification techniques– KnowledgeFlow module to enable batch-processing

style discovery and incremental mining– Some visualisation facilities

Page 19: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Weka: A Brief Introduction

• Weka Explorer– For investigative interactive data mining with small size data

sets – Preprocess, Classify, Cluster, Associate, Select Attributes

and Visualise pages

Page 20: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Weka: A Brief Introduction

• Weka Simple CLI– Weka facilities as Java classes – Calling the Java functions as commands

Page 21: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Weka: A Brief Introduction

• Weka Experimenter– Comparing performances of different classification solutions

on a collection of data sets

Page 22: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Weka: A Brief Introduction• Weka KnowledgeFlow

– Setting up a flow of knowledge discovery in a diagram– Overview of the entire discovery project

Page 23: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

Chapter Summary

• Importance of data in operation and importance of information and knowledge in decision-making

• Data rich does not mean information rich• Data mining: automatic or semi automatic data

understanding and decision support • To classify, to estimate, to predict and to describe• Data mining closely relates to database, statistics and

machine learning• Data mining: from technology towards application• A lot of potential uses and a lot of challenges to face• Weka: excellent tool to support teaching data mining

Page 24: Data Mining Techniques and Applications, 1 st edition Hongbo Du ISBN 978-1-84480-891-5 © 2010 Cengage Learning Chapter One Introduction

Data Mining Techniques and Applications, 1st editionHongbo Du

ISBN 978-1-84480-891-5 © 2010 Cengage Learning

References

Read Chapter 1 of Data Mining Techniques and Applications

Useful further references• Han & Kamber, Chapter 1• Berry & Linoff, Chapter 1 (business-like)• Kdnuggets: http://www.kdnuggets.com/