29
Data Mining: Concepts and Techniques — Chapter 1 and 2 — Slides related to: Data Mining: Concepts and Techniques 1 — Introduction and Data preprocessing — Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign www.cs.uiuc.edu/~hanj ©2006 Jiawei Han and Micheline Kamber. All rights reserved.

Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Embed Size (px)

Citation preview

Page 1: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Da

ta M

inin

g:

Con

cep

ts a

nd

Tec

hn

iqu

es—

Ch

ap

ter

1 a

nd

2 —

Slid

es r

elat

ed t

o:

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es1

p

—In

trod

uct

ion

an

d D

ata

pre

pro

cess

ing

Jia

wei

Ha

n a

nd

Mic

hel

ine

Ka

mb

er

Dep

art

men

t of

Com

pu

ter

Scie

nce

Un

iver

sity

of

Illin

ois

at

Urb

an

a-C

ha

mp

aig

n

ww

w.c

s.u

iuc.

edu

/~h

an

200

6 Ji

aw

ei H

an

an

d M

ich

elin

e K

am

ber

. A

ll ri

gh

ts r

eser

ved

.

Page 2: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Wh

y D

ata

Min

ing?

nTh

e Ex

plos

ive

Gro

wth

of

Dat

a: fr

om t

erab

ytes

to

peta

byte

s

nD

ata

colle

ctio

n an

d da

ta a

vaila

bilit

y

nAu

tom

ated

dat

a co

llect

ion

tool

s, d

atab

ase

syst

ems,

Web

,

com

pute

rized

soc

iety

nM

ajor

sou

rces

of

abun

dant

dat

a

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es2

j

nBu

sine

ss:

Web

, e-c

omm

erce

, tra

nsac

tions

, sto

cks,

nSc

ienc

e: R

emot

e se

nsin

g, b

ioin

form

atic

s, s

cien

tific

sim

ulat

ion,

nSo

ciet

y an

d ev

eryo

ne:

new

s, d

igita

l cam

eras

, You

Tube

nW

e ar

e dr

owni

ng in

dat

a, b

ut s

tarv

ing

for

know

ledg

e!

n“N

eces

sity

is t

he m

othe

r of

inve

ntio

n”—

Dat

a m

inin

g—Au

tom

ated

anal

ysis

of

mas

sive

dat

a se

ts

Page 3: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Ex. 1

: M

arke

t A

nal

ysis

an

d M

anag

emen

t

nW

here

doe

s th

e da

ta c

ome

from

?—Cr

edit

card

tra

nsac

tions

, lo

yalty

car

ds,

disc

ount

cou

pons

, cu

stom

er c

ompl

aint

cal

ls, pl

us (

publ

ic)

lifes

tyle

stu

dies

nTa

rget

mar

ketin

gn

Find

clu

ster

s of

“m

odel

” cu

stom

ers

who

sha

re t

he s

ame

char

acte

ristic

s: in

tere

st,

inco

me

leve

l, sp

endi

ng h

abits

, etc

.

nD

eter

min

e cu

stom

er p

urch

asin

g pa

tter

ns o

ver

time

nCr

oss-

mar

ket

anal

ysis

—Fi

nd a

ssoc

iatio

ns/c

o-re

latio

ns b

etw

een

prod

uct

sale

s,

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es3

y/

p,

& p

redi

ct b

ased

on

such

ass

ocia

tion

nCu

stom

er p

rofil

ing—

Wha

t ty

pes

of c

usto

mer

s bu

y w

hat

prod

ucts

(cl

uste

ring

or c

lass

ifica

tion)

nCu

stom

er r

equi

rem

ent

anal

ysis

nId

entif

y th

e be

st p

rodu

cts

for

diff

eren

t gr

oups

of

cust

omer

s

nPr

edic

t w

hat

fact

ors

will

att

ract

new

cus

tom

ers

nPr

ovis

ion

of s

umm

ary

info

rmat

ion

nM

ultid

imen

sion

al s

umm

ary

repo

rts

nSt

atis

tical

sum

mar

y in

form

atio

n (d

ata

cent

ral t

ende

ncy

and

varia

tion)

Page 4: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Ex. 2

: C

orpo

rate

An

alys

is &

Ris

k M

anag

emen

t

nFi

nanc

e pl

anni

ng a

nd a

sset

eva

luat

ion

nca

sh fl

ow a

naly

sis

and

pred

ictio

n

nco

ntin

gent

cla

im a

naly

sis

to e

valu

ate

asse

ts

ncr

oss-

sect

iona

l and

tim

e se

ries

anal

ysis

(fin

anci

al-r

atio

, tr

end

anal

ysis

, etc

.)

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es4

y,

)

nRes

ourc

e pl

anni

ng

nsu

mm

ariz

e an

d co

mpa

re t

he r

esou

rces

and

spe

ndin

g

nCo

mpe

titio

n

nm

onito

r co

mpe

titor

s an

d m

arke

t di

rect

ions

ngr

oup

cust

omer

s in

to c

lass

es a

nd a

cla

ss-b

ased

pric

ing

proc

edur

e

nse

t pr

icin

g st

rate

gy in

a h

ighl

y co

mpe

titiv

e m

arke

t

Page 5: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Ex. 3

: Fr

aud

Det

ecti

on &

Min

ing

Un

usu

al P

atte

rns

nAp

proa

ches

: Cl

uste

ring

& m

odel

con

stru

ctio

n fo

r fr

auds

, ou

tlier

ana

lysi

s

nAp

plic

atio

ns:

Hea

lth c

are,

ret

ail,

cred

it ca

rd s

ervi

ce, t

elec

omm

.n

Auto

insu

ranc

e: r

ing

of c

ollis

ions

nM

oney

laun

derin

g:su

spic

ious

mon

etar

y tr

ansa

ctio

ns

nM

edic

al in

sura

nce

nPr

ofes

sion

al p

atie

nts,

rin

g of

doc

tors

, an

d rin

g of

ref

eren

ces

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es5

p,

g,

g

nU

nnec

essa

ry o

r co

rrel

ated

scr

eeni

ng t

ests

nTe

leco

mm

unic

atio

ns:

phon

e-ca

ll fr

aud

nPh

one

call

mod

el:

dest

inat

ion

of t

he c

all,

dura

tion,

tim

e of

day

or

wee

k.

Anal

yze

patt

erns

tha

t de

viat

e fr

om a

n ex

pect

ed n

orm

nRet

ail i

ndus

try

nAn

alys

ts e

stim

ate

that

38%

of

reta

il sh

rink

is d

ue t

o di

shon

est

empl

oyee

s

nAn

ti-te

rror

ism

Page 6: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Evol

uti

on o

f D

atab

ase

Tech

nol

ogy

n19

60s:

nD

ata

colle

ctio

n, d

atab

ase

crea

tion,

IM

S an

d ne

twor

k D

BMS

n19

70s:

n

Rel

atio

nal d

ata

mod

el,

rela

tiona

l DBM

S im

plem

enta

tion

n19

80s:

n

Adva

nced

dat

a m

odel

s (e

xten

ded-

rela

tiona

l, O

O, d

educ

tive,

etc

.)

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es6

()

nAp

plic

atio

n-or

ient

ed D

BMS

(spa

tial,

tem

pora

l, m

ultim

edia

, et

c.)

n19

90s:

n

Dat

a m

inin

g, d

ata

war

ehou

sing

, m

ultim

edia

dat

abas

es,

and

Web

da

taba

ses

n20

00s

nSt

ream

dat

a m

anag

emen

t an

d m

inin

g

nD

ata

min

ing

and

its a

pplic

atio

ns

nW

eb t

echn

olog

y (X

ML,

dat

a in

tegr

atio

n) a

nd g

loba

l inf

orm

atio

n sy

stem

s

Page 7: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Wh

at I

s D

ata

Min

ing?

nD

ata

min

ing

(kno

wle

dge

disc

over

y fr

om d

ata)

n

Extr

actio

n of

inte

rest

ing

(non

-triv

ial,

impl

icit,

pre

viou

sly

unkn

own

and

pote

ntia

lly u

sefu

l)pa

tter

ns o

r kn

owle

dge

from

hu

ge a

mou

nt o

f da

ta

nD

ata

min

ing:

a m

isno

mer

?

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es7

nAl

tern

ativ

e na

mes

nKn

owle

dge

disc

over

y (m

inin

g) in

dat

abas

es (

KDD

), k

now

ledg

e ex

trac

tion,

dat

a/pa

tter

n an

alys

is, d

ata

arch

eolo

gy, d

ata

dred

ging

, inf

orm

atio

n ha

rves

ting,

bus

ines

s in

telli

genc

e, e

tc.

nW

atch

out

: Is

eve

ryth

ing

“dat

a m

inin

g”?

nSi

mpl

e se

arch

and

que

ry p

roce

ssin

g

n(D

educ

tive)

exp

ert

syst

ems

Page 8: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Kn

owle

dge

Dis

cove

ry (

KD

D)

Pro

cess

nD

ata

min

ing—

core

of

know

ledg

e di

scov

ery

proc

ess

Tk

ltD

tDat

a M

inin

g

Pat

tern

eva

luat

ion

and

pres

enta

tion

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es8

Dat

a C

lean

ing Dat

a In

tegr

atio

n

Dat

abas

es

Dat

a W

areh

ouseTa

sk-r

elev

ant D

ata

Sele

ctio

n an

d tr

ansf

orm

atio

n

Page 9: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Wh

y D

ata

Pre

proc

essi

ng?

nD

ata

in t

he r

eal w

orld

is d

irty

nin

com

plet

e: la

ckin

g at

trib

ute

valu

es, l

acki

ng c

erta

in

attr

ibut

es o

f in

tere

st, o

r co

ntai

ning

onl

y ag

greg

ate

data

ne.

g., o

ccup

atio

n=“

”i

ti

itli

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es9

nno

isy:

con

tain

ing

erro

rs o

r ou

tlier

sn

e.g.

, Sal

ary=

“-10

”n

inco

nsis

tent

: co

ntai

ning

dis

crep

anci

es in

cod

es o

r na

mes

ne.

g., A

ge=

“42”

Birt

hdat

e=“0

3/07

/199

7”n

e.g.

, Was

rat

ing

“1,2

,3”,

now

rat

ing

“A, B

, C”

ne.

g., d

iscr

epan

cy b

etw

een

dupl

icat

e re

cord

s

Page 10: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Wh

y Is

Dat

a D

irty

?

nIn

com

plet

e da

ta m

ay c

ome

from

n“N

ot a

pplic

able

” da

ta v

alue

whe

n co

llect

edn

Diff

eren

t co

nsid

erat

ions

bet

wee

n th

e tim

e w

hen

the

data

was

col

lect

ed

and

whe

n it

is a

naly

zed.

nH

uman

/har

dwar

e/so

ftw

are

prob

lem

sn

Noi

sy d

ata

(inco

rrec

t va

lues

) m

ay c

ome

from

nFa

ulty

dat

a co

llect

ion

inst

rum

ents

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es10

nH

uman

or

com

pute

r er

ror

at d

ata

entr

yn

Erro

rs in

dat

a tr

ansm

issi

onn

Inco

nsis

tent

dat

a m

ay c

ome

from

nD

iffer

ent

data

sou

rces

nFu

nctio

nal d

epen

denc

y vi

olat

ion

(e.g

., m

odify

som

e lin

ked

data

)n

Dup

licat

e re

cord

s al

so n

eed

data

cle

anin

g

Page 11: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Wh

y Is

Dat

a P

repr

oces

sin

g Im

port

ant?

nN

o qu

ality

dat

a, n

o qu

ality

min

ing

resu

lts!

nQ

ualit

y de

cisi

ons

mus

t be

bas

ed o

n qu

ality

dat

a

ne.

g., d

uplic

ate

or m

issi

ng d

ata

may

cau

se in

corr

ect

or e

ven

mis

lead

ing

stat

istic

s.

nD

ata

war

ehou

se n

eeds

con

sist

ent

inte

grat

ion

of q

ualit

y da

ta

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es11

gq

y

nD

ata

extr

actio

n, c

lean

ing,

and

tra

nsfo

rmat

ion

com

pris

es

the

maj

ority

of

the

wor

k of

bui

ldin

g a

data

war

ehou

se

Page 12: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Form

s of

Dat

a P

repr

oces

sin

g

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es12

Page 13: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Arc

hit

ectu

re:

Typi

cal D

ata

Min

ing

Syst

em

Dat

aM

inin

gEn

gine

Patt

ern

Eval

uatio

n

Gra

phic

al U

ser

Inte

rfac

e

Know

led

ge-

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es13

data

cle

anin

g, in

tegr

atio

n, a

nd s

elec

tion

Dat

abas

e or

Dat

a W

areh

ouse

Ser

ver

Dat

a M

inin

g En

gine

edge

Base

Dat

abas

eD

ata

War

ehou

seW

orld

-Wid

eW

ebO

ther

Inf

oR

epos

itor

ies

Page 14: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Wh

y N

ot T

radi

tion

al D

ata

An

alys

is?

nTr

emen

dous

am

ount

of

data

nAl

gorit

hms

mus

t be

hig

hly

scal

able

to

hand

le la

rge

amou

nts

of d

ata

nH

igh-

dim

ensi

onal

ity o

f da

ta

nM

icro

-arr

ay m

ay h

ave

tens

of

thou

sand

s of

dim

ensi

ons

nH

igh

com

plex

ity o

f da

ta

Dt

td

dt

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es14

nD

ata

stre

ams

and

sens

or d

ata

nTi

me-

serie

s da

ta, t

empo

ral d

ata,

seq

uenc

e da

ta

nSt

ruct

ure

data

, gra

phs,

soc

ial n

etw

orks

and

mul

ti-lin

ked

data

nH

eter

ogen

eous

dat

abas

es a

nd le

gacy

dat

abas

es

nSp

atia

l, sp

atio

tem

pora

l, m

ultim

edia

, tex

t an

d W

eb d

ata

nN

ew a

nd s

ophi

stic

ated

app

licat

ions

Page 15: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Dat

a M

inin

g: C

lass

ific

atio

n S

chem

es

nG

ener

al fun

ctio

nalit

y

nD

escr

iptiv

e da

ta m

inin

g

nPr

edic

tive

data

min

ing

nD

iffer

ent

view

sle

adto

diff

eren

tcl

assi

ficat

ions

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es15

nD

iffer

ent

view

s le

ad t

o di

ffer

ent

clas

sific

atio

ns

nD

ata

view

: Ki

nds

of d

ata

to b

e m

ined

nKn

owle

dge

view

: Ki

nds

of k

now

ledg

e to

be

disc

over

ed

nM

etho

dvi

ew:

Kind

s of

tec

hniq

ues

utili

zed

nAp

plic

atio

nvi

ew:

Kind

s of

app

licat

ions

ada

pted

Page 16: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Dat

a M

inin

g: o

n w

hat

kin

ds o

f da

ta?

nD

atab

ase-

orie

nted

dat

a se

ts a

nd a

pplic

atio

ns

nRel

atio

nal d

atab

ase,

dat

a w

areh

ouse

, tr

ansa

ctio

nal d

atab

ase

nAd

vanc

ed d

ata

sets

and

adv

ance

d ap

plic

atio

ns

nO

bjec

t-re

latio

nal d

atab

ases

nTi

me-

serie

s da

ta,

tem

pora

l dat

a, s

eque

nce

data

(in

cl.

bio-

sequ

ence

s)

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es16

nSp

atia

l dat

a an

d sp

atio

tem

pora

l dat

a

nTe

xt d

atab

ases

and

Mul

timed

ia d

atab

ases

nD

ata

stre

ams

and

sens

or d

ata

nTh

e W

orld

-Wid

e W

eb

nH

eter

ogen

eous

dat

abas

es a

nd le

gacy

dat

abas

es

Page 17: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Dat

a M

inin

g –

wh

at k

inds

of

patt

ern

s?

nCo

ncep

t/cl

ass

desc

riptio

n:

nCh

arac

teriz

atio

n: s

umm

ariz

ing

the

data

of

the

clas

s un

der

stud

y in

gen

eral

ter

ms

nE.

g. C

hara

cter

istic

s of

cus

tom

ers

spen

ding

mor

e th

an 1

0000

se

k pe

r ye

ar

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es17

nD

iscr

imin

atio

n: c

ompa

ring

targ

et c

lass

with

oth

er (

cont

rast

ing)

cl

asse

s

nE.

g. C

ompa

re t

he c

hara

cter

istic

s of

pro

duct

s th

at h

ad a

sal

es

incr

ease

to

prod

ucts

tha

t ha

d a

sale

s de

crea

se la

st y

ear

Page 18: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Dat

a M

inin

g –

wh

at k

inds

of

patt

ern

s?

nFr

eque

nt p

atte

rns,

ass

ocia

tion,

cor

rela

tions

nFr

eque

nt it

emse

t

nFr

eque

nt s

eque

ntia

l pat

tern

nFr

eque

nt s

truc

ture

d pa

tter

n

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es18

nE.

g. b

uy(X

, “D

iape

r”)

buy(

X, “

Beer

”) [

supp

ort=

0.5%

, con

fiden

ce=

75%

]

conf

iden

ce:

if X

buys

a d

iape

r, t

hen

ther

e is

75%

cha

nce

that

X b

uys

beer

supp

ort:

of

all t

rans

actio

ns u

nder

con

side

ratio

n 0.

5% s

how

ed tha

t di

aper

and

beer

wer

e bo

ught

tog

ethe

r

nE.

g. A

ge(X

, ”20

..29”

) an

d in

com

e(X,

”20

k..2

9k”)

bu

ys(X

, ”cd

-pla

yer”

) [s

uppo

rt=

2%, c

onfid

ence

=60

%]

Page 19: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Dat

a M

inin

g –

wh

at k

inds

of

patt

ern

s?

nCl

assi

ficat

ion

and

pred

ictio

n

nCo

nstr

uct

mod

els

(fun

ctio

ns)

that

des

crib

e an

d di

stin

guis

h cl

asse

s or

con

cept

s fo

r fu

ture

pre

dict

ion.

The

deriv

ed m

odel

is b

ased

on

anal

yzin

g tr

aini

ng d

ata

data

who

secl

ass

labe

lsar

ekn

own

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es19

–da

ta w

hose

cla

ss la

bels

are

kno

wn.

nE.

g., c

lass

ify c

ount

ries

base

d on

(cl

imat

e), o

r cl

assi

fy c

ars

base

d on

(ga

s m

ileag

e)

nPr

edic

t so

me

unkn

own

or m

issi

ng n

umer

ical

val

ues

Page 20: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

nCl

uste

r an

alys

isn

Clas

s la

bel i

s un

know

n: G

roup

dat

a to

for

m n

ew c

lass

es, e

.g.,

clus

ter

cust

omer

s to

fin

d ta

rget

gro

ups

for

mar

ketin

gn

Max

imiz

ing

intr

a-cl

ass

sim

ilarit

y &

min

imiz

ing

inte

rcla

ss s

imila

rity

nO

utlie

r an

alys

is

Dat

a M

inin

g –

wh

at k

inds

of

patt

ern

s?

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es20

nO

utlie

r: D

ata

obje

ct t

hat

does

not

com

ply

with

the

gen

eral

beh

avio

r of

the

dat

an

Noi

se o

r ex

cept

ion?

Use

ful i

n fr

aud

dete

ctio

n, r

are

even

ts a

naly

sis

nTr

end

and

evol

utio

n an

alys

isn

Tren

d an

d de

viat

ion

Page 21: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Are

All

the

“Dis

cove

red”

Pat

tern

s In

tere

stin

g?

nD

ata

min

ing

may

gen

erat

e th

ousa

nds

of p

atte

rns:

Not

all

of t

hem

are

inte

rest

ing

nSu

gges

ted

appr

oach

: H

uman

-cen

tere

d, q

uery

-bas

ed,

focu

sed

min

ing

nIn

tere

stin

gnes

s m

easu

res

nA

patt

ern

is in

tere

stin

gif

it is

eas

ily u

nder

stoo

dby

hum

ans,

val

idon

new

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es21

pg

yy

,

or t

est

data

with

som

e de

gree

of

cert

aint

y, p

oten

tially

use

ful,

nove

l,or

valid

ates

som

e hy

poth

esis

that

a u

ser

seek

s to

con

firm

nO

bjec

tive

vs.

su

bjec

tive

inte

rest

ingn

ess

mea

sure

s

nO

bjec

tive:

base

d on

sta

tistic

s an

d st

ruct

ures

of

patt

erns

, e.

g., su

ppor

t,

conf

iden

ce,

etc.

nSu

bjec

tive:

base

d on

use

r’s b

elie

fin

the

dat

a, e

.g.,

unex

pect

edne

ss,

nove

lty, a

ctio

nabi

lity,

etc

.

Page 22: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Fin

d A

ll an

d O

nly

In

tere

stin

g P

atte

rns?

nFi

nd a

ll th

e in

tere

stin

g pa

tter

ns:

Com

plet

enes

s

nCa

n a

data

min

ing

syst

em fi

nd a

llth

e in

tere

stin

g pa

tter

ns?

Do

we

need

to

find

allo

f th

e in

tere

stin

g pa

tter

ns?

nH

euris

tic v

s. e

xhau

stiv

e se

arch

nAs

soci

atio

n vs

. cla

ssifi

catio

n vs

. clu

ster

ing

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es22

g

nSe

arch

for

only

inte

rest

ing

patt

erns

: An

opt

imiz

atio

n pr

oble

m

nCa

n a

data

min

ing

syst

em fi

nd o

nly

the

inte

rest

ing

patt

erns

?

nAp

proa

ches

nFi

rst

gene

rate

all

the

patt

erns

and

the

n fil

ter

out

the

unin

tere

stin

g on

es

nG

ener

ate

only

the

inte

rest

ing

patt

erns

—m

inin

g qu

ery

optim

izat

ion

Page 23: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Dat

a M

inin

g –

wh

at t

ech

niq

ues

use

d?

Dat

abas

e Te

chno

logy

Stat

istic

s

Mhi

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es23

Dat

a M

inin

gM

achi

neLe

arni

ng

Patt

ern

Reco

gniti

onAl

gorit

hmO

ther

Dis

cipl

ines

Visu

aliz

atio

n

Page 24: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Top-

10

Mos

t P

opu

lar

DM

Alg

orit

hm

s:1

8 I

den

tifi

ed C

andi

date

s (I

)

nCl

assi

ficat

ion

n#

1. C

4.5:

Qui

nlan

, J. R

. C4.

5: P

rogr

ams

for

Mac

hine

Lea

rnin

g. M

orga

n Ka

ufm

ann.

, 199

3.n

#2.

CAR

T: L

. Br

eim

an, J.

Frie

dman

, R. O

lshe

n, a

nd C

. Sto

ne. C

lass

ifica

tion

and

Regr

essi

on T

rees

. W

adsw

orth

, 19

84.

n#

3. K

Nea

rest

Nei

ghbo

urs

(kN

N):

Has

tie, T.

and

Tib

shira

ni,

R. 1

996.

D

iscr

imin

ant

Adap

tive

Nea

rest

Nei

ghbo

r Cl

assi

ficat

ion.

TPA

MI.

18(

6)

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es24

n#

4. N

aive

Bay

es H

and,

D.J

., Yu

, K.,

2001

. Idi

ot's

Bay

es:

Not

So

Stup

id

Afte

r Al

l? I

nter

nat.

Sta

tist.

Rev

. 69

, 385

-398

.n

Stat

istic

al L

earn

ing

n#

5. S

VM:

Vapn

ik,

V. N

. 199

5. T

he N

atur

e of

Sta

tistic

al L

earn

ing

Theo

ry.

Sprin

ger-

Verla

g.n

#6.

EM

: M

cLac

hlan

, G

. an

d Pe

el, D

. (20

00).

Fin

ite M

ixtu

re M

odel

s. J

. W

iley,

New

Yor

k. A

ssoc

iatio

n An

alys

isn

#7.

Apr

iori:

Rak

esh

Agra

wal

and

Ram

akris

hnan

Srik

ant.

Fas

t Al

gorit

hms

for

Min

ing

Asso

ciat

ion

Rul

es. In

VLD

B '9

4.n

#8.

FP-

Tree

: H

an, J

., Pe

i, J.

, and

Yin

, Y.

2000

. Min

ing

freq

uent

pat

tern

s w

ithou

t ca

ndid

ate

gene

ratio

n. I

n SI

GM

OD

'00.

Page 25: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

The

18

Ide

nti

fied

Can

dida

tes

(II)

nLi

nk M

inin

gn

#9.

Pag

eRan

k: B

rin, S

. and

Pag

e, L

. 199

8. T

he a

nato

my

of a

la

rge-

scal

e hy

pert

extu

al W

eb s

earc

h en

gine

. In

WW

W-7

, 199

8.n

#10

. HIT

S: K

lein

berg

, J. M

. 199

8. A

utho

ritat

ive

sour

ces

in a

hy

perli

nked

env

ironm

ent.

SO

DA,

199

8.n

Clus

terin

g#

11K

MM

QJ

BS

thd

fl

ifiti

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es25

n#

11. K

-Mea

ns:

Mac

Que

en, J

. B.,

Som

e m

etho

ds fo

r cl

assi

ficat

ion

and

anal

ysis

of

mul

tivar

iate

obs

erva

tions

, in

Proc

. 5th

Ber

kele

y Sy

mp.

Mat

hem

atic

al S

tatis

tics

and

Prob

abili

ty, 1

967.

n#

12. B

IRCH

: Zh

ang,

T.,

Ram

akris

hnan

, R.,

and

Livn

y, M

. 199

6.

BIRCH

: an

eff

icie

nt d

ata

clus

terin

g m

etho

d fo

r ve

ry la

rge

data

base

s. I

n SI

GM

OD

'96.

nBa

ggin

g an

d Bo

ostin

gn

#13

. Ada

Boos

t: F

reun

d, Y

. and

Sch

apire

, R. E

. 199

7. A

dec

isio

n-th

eore

tic g

ener

aliz

atio

n of

on-

line

lear

ning

and

an

appl

icat

ion

to

boos

ting.

J. C

ompu

t. S

yst.

Sci

. 55,

1 (

Aug.

199

7), 1

19-1

39.

Page 26: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

The

18

Ide

nti

fied

Can

dida

tes

(III

)

nSe

quen

tial P

atte

rns

n#

14. G

SP:

Srik

ant,

R. a

nd A

graw

al,

R. 1

996.

Min

ing

Sequ

entia

l Pat

tern

s:

Gen

eral

izat

ions

and

Per

form

ance

Im

prov

emen

ts.

In P

roce

edin

gs o

f th

e 5t

h In

tern

atio

nal C

onfe

renc

e on

Ext

endi

ng D

atab

ase

Tech

nolo

gy,

1996

.n

#15

. Pre

fixSp

an:

J. P

ei, J

. Han

, B.

Mor

taza

vi-A

sl, H

. Pi

nto,

Q. C

hen,

U.

Day

al a

nd M

-C. H

su. Pr

efix

Span

: M

inin

g Se

quen

tial P

atte

rns

Effic

ient

ly b

y Pr

efix

-Pro

ject

edPa

tter

nG

row

thIn

ICD

E'0

1

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es26

Pref

ixPr

ojec

ted

Patt

ern

Gro

wth

. In

ICD

E 01

.n

Inte

grat

ed M

inin

gn

#16

. CBA

: Li

u, B

., H

su, W

. and

Ma,

Y. M

. Int

egra

ting

clas

sific

atio

n an

d as

soci

atio

n ru

le m

inin

g. K

DD

-98.

n

Rou

gh S

ets

n#

17. F

indi

ng r

educ

t: Z

dzis

law

Paw

lak,

Rou

gh S

ets:

The

oret

ical

Asp

ects

of

Reas

onin

g ab

out

Dat

a, K

luw

er A

cade

mic

Pub

lishe

rs,

Nor

wel

l, M

A, 1

992

nG

raph

Min

ing

n#

18. g

Span

: Ya

n, X

. an

d H

an, J.

200

2. g

Span

: G

raph

-Bas

ed S

ubst

ruct

ure

Patt

ern

Min

ing.

In

ICD

M '0

2.

Page 27: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Top-

10

Alg

orit

hm

Fin

ally

Sel

ecte

d at

IC

DM

’06

n#

1:

C4

.5 (

61

vot

es)

n#

2:

K-M

ean

s (6

0 v

otes

)n

#3

: SV

M (

58

vot

es)

n#

4:

Apr

iori

(5

2 v

otes

)

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es27

n#

5:

EM (

48

vot

es)

n#

6:

Pag

eRan

k (4

6 v

otes

)n

#7

: A

daB

oost

(4

5 v

otes

)n

#7

: kN

N (

45

vot

es)

n#

7:

Nai

ve B

ayes

(4

5 v

otes

)n

#10

: C

AR

T (3

4 vo

tes)

Page 28: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

A B

rief

His

tory

of

Dat

a M

inin

g So

ciet

y

n19

89 I

JCAI

Wor

ksho

p on

Kno

wle

dge

Dis

cove

ry in

Dat

abas

es

nKn

owle

dge

Dis

cove

ry in

Dat

abas

es (

G. Pi

atet

sky-

Shap

iro a

nd W

. Fra

wle

y,

1991

)

n19

91-1

994

Wor

ksho

ps o

n Kn

owle

dge

Dis

cove

ry in

Dat

abas

es

nAd

vanc

es in

Kno

wle

dge

Dis

cove

ry a

nd D

ata

Min

ing

(U. F

ayya

d, G

. Pi

atet

sky-

Shap

iro,

P. S

myt

h, a

nd R

. Uth

urus

amy,

199

6)

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es28

yp

,y

,y,

)

n19

95-1

998

Inte

rnat

iona

l Con

fere

nces

on

Know

ledg

e D

isco

very

in D

atab

ases

an

d D

ata

Min

ing

(KD

D’9

5-98

)

nJo

urna

l of

Dat

a M

inin

g an

d Kn

owle

dge

Dis

cove

ry (

1997

)

nAC

M S

IGKD

D c

onfe

renc

es s

ince

199

8 an

d SI

GKD

D E

xplo

ratio

ns

nM

ore

conf

eren

ces

on d

ata

min

ing

nPA

KDD

(19

97),

PKD

D (

1997

), S

IAM

-Dat

a M

inin

g (2

001)

, (I

EEE)

ICD

M

(200

1), e

tc.

nAC

M T

rans

actio

ns o

n KD

D s

tart

ing

in 2

007

Page 29: Data Mining: Concepts and Techniques ©2006 Jiawei Han and Micheline Kamber…732A31/material/fo-intro-1.pdf ·  · 2012-01-17— Introduction and Data preprocessing — Jiawei Han

Con

fere

nce

s an

d Jo

urn

als

on D

ata

Min

ing

nKD

D C

onfe

renc

esn

ACM

SIG

KDD

Int

. Con

f. on

Kn

owle

dge

Dis

cove

ry in

D

atab

ases

and

Dat

a M

inin

g (K

DD

)n

SIAM

Dat

a M

inin

g Co

nf. (

SDM

)(I

EEE)

It

Cf

Dt

nO

ther

rel

ated

con

fere

nces

nAC

M S

IGM

OD

nVL

DB

n(I

EEE)

ICD

E

nW

WW

, SIG

IR

ICM

LCV

PRN

IPS

Dat

a M

inin

g: C

once

pts

and

Tech

niqu

es29

n(I

EEE)

Int

. Con

f. on

Dat

a M

inin

g (I

CDM

)n

Conf

. on

Prin

cipl

es a

nd

prac

tices

of

Know

ledg

e D

isco

very

and

Dat

a M

inin

g (P

KDD

)n

Paci

fic-A

sia

Conf

. on

Know

ledg

e D

isco

very

and

Dat

a M

inin

g (P

AKD

D)

nIC

ML,

CVP

R, N

IPS

nJo

urna

ls

nD

ata

Min

ing

and

Know

ledg

e D

isco

very

(D

AMI

or D

MKD

)

nIE

EE T

rans

. On

Know

ledg

e an

d D

ata

Eng.

(TK

DE)

nKD

D E

xplo

ratio

ns

nAC

M T

rans

. on

KDD