31
Data Mining Lecture 4

Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Embed Size (px)

Citation preview

Page 1: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Mining

Lecture 4

Page 2: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Course Syllabus

• Course topics:• Data Management and Data Collection Techniques for

Data Mining Applications (Week3-Week4)– Data Warehouses: Gathering Raw Data from Relational

Databases and transforming into Information. – Information Extraction and Data Processing Techniques– Data Marts: The need for building highly specialized data

storages for data mining applications

• Case Study 1: Working and experiencing on the properties of The Retail Banking Data Mart (Week 4 –Assignment1)

Page 3: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

• Why we should do pre-processing ?

• pre-processing takes %80 of the time

• Real world data is not perfect (dirty)– missing values (no data entrance)

• eg. %35 of Education Field is incomplete• eg. %20 of Birth Date is incomplete• eg. %45 of Work Title is incomplete• eg. %60 of Income is incomplete

Page 4: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing

TechniquesBÖLÜM ADI DEĞİŞKEN ADEDİATM 36 BİREYSEL KREDİLER 108 BİREYSEL SİGORTALAR 26 CALLCENTER 30 ÇEK 69 DEBIT KARTLAR 52 DEMOGRAFİK VERİLER 54 EKONOMİK VERİLER 402 FATURA ÖDEMELERİ 64 GAYRİ NAKDİ KREDİLER 48 HAZİNE BONOSU DEVLET TAHVİLİ 64 INTERNET 30 KREDİ KARTLARI 230 KREDİLİ MEVUAT HESABI 33 MAAS ODEMELERİ 17 POS 30 REPO 28 TİCARİ KREDİLER 68 TİCARİ SİGORTALAR 26 VADELİ MEVDUATLAR 77 VADESİZ MEVDUATLAR 318 YATIRIM FONLARI 106 DİĞER ÜRÜNLER 21 TOPLAM 1,937

Page 5: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing

Techniques

IND_CAROWNERGROUP computed data discreteINDCOMM_COUNTRY_HOUSE raw data discreteINDCOMM_COUNTRY_WORK raw data discreteINDCOMM_COUNTY_HOUSE raw data discreteINDCOMM_COUNTY_WORK raw data discreteINDCOMM_EDUCATIONLEVEL raw data discreteIND_EMPLOYEEFLAG computed data discrete booleanIND_GENDER raw data discreteINDCOMM_HABITANT_HOUSE computed data discreteINDCOMM_HABITANT_WORK computed data discreteIND_HOUSEHOLDINCOMEGROUP computed data discreteIND_HOUSEHOLDNUMBER computed data continuous integerIND_INCOMEGROUP computed data discreteIND_INTERNETFLAG computed data discrete booleanIND_MARITALSTATUS raw data discreteIND_MOBILEPHONEUSAGEFLAG computed data discrete boolean

Page 6: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing

TechniquesTABLO SAHA DOLULUK YUZDE VERİ DEĞERİ DOLULUK DURUMUMUSTERI_GERCEK EGITIM DURUMU 44% ÇOK KRİTİK ÇOK KRİTİKMUSTERI_GERCEK IS YERINDEKI UNVAN 39% ÇOK KRİTİK ÇOK KRİTİKMUSTERI_TUZEL ORTAKLIK TIPI 36% ÇOK KRİTİK ÇOK KRİTİKMUSTERI_MUSTERI DOGUMTARIHI 12% ÇOK KRİTİK KRİTİKMUSTERI_GERCEK MESLEK KODU 8% ÇOK KRİTİK AZ KRİTİKMUSTERI_GERCEK CINSIYET 4% ÇOK KRİTİK AZ KRİTİKMUSTERI_TUZEL FAALIYET ALANI 0% ÇOK KRİTİK DOLUMUSTERI_TUZEL IS SAHASI 0% ÇOK KRİTİK DOLUMUSTERI_MUSTERI TIP 0% ÇOK KRİTİK DOLUMUSTERI_GERCEK CALISMA DURUMU 41% KRİTİK ÇOK KRİTİKMUSTERI_MUSTERI GIRIS KANALI 36% KRİTİK ÇOK KRİTİKMUSTERI_GERCEK MEDENI DURUMU 18% KRİTİK KRİTİKMUSTERI_MUSTERI DOGUM YERI 5% KRİTİK AZ KRİTİKMUSTERI_TUZEL KURULUS TIPI GRUBU 0% KRİTİK DOLUMUSTERI_TUZEL KURULUS TIPI 0% KRİTİK DOLUMUSTERI_GERCEK SON OKUL ADI 99% AZ KRİTİK ÇOK KRİTİKMUSTERI_MUSTERI SEGMENT 98% AZ KRİTİK ÇOK KRİTİKMUSTERI_GERCEK NUFUSA KAYITLI IL 88% AZ KRİTİK ÇOK KRİTİK

Page 7: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

– erraneous (noisy)• eg. Birth Date > current date or Birth Date <1850

(approx. %10 of the data) • eg. permissible values Education Field (C: college

U: university H: high school D: doctorate M: master S : secondary school P: primary school I : illegitimate) but X,Q,Y,T values may seen (approx. %10 of the data)

• Income field is negative (approx. %15 of the data)

Page 8: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

– inconsistent- discrepancies in codes or names• eg. Birth Date =’01/01/1955’, 54 (same info but

different forms)• eg. Education Field coded

(C: college U: university H: high school D: doctorate M: master S : secondary school P: primary school I : illegitimate)

(5: college 3: university 4: high school 1: doctorate 2: master 6 : secondary school 7: primary school 8 : illegitimate)

• Income field continuous (3200 K) or interval based (3000-4000 K)

Page 9: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

– Where may dirtiness come from

the reasons of missing valuesdifferent considerations in coding and analyzing

(discrepancies with time)

hardware/software problems

different sources not aligned with same data dictionary;

Field 1Field 2Field 3

Field 1Field 2Field 3

Field 1Field 2Field 3

Source 1 Source 2 Source 3

Field 1Field 2Field 3

Page 10: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

– Where may dirtiness come from

the reasons of erraneous values

Human gives incomplete, about to be correct information

AD

DOĞUM YERİ

DOĞUM TARİHİ

SOYAD

ADRESİ

ÇALIŞMA ÜNVANI

ÇALIŞMA YERİ

.........

Metin Ü.

GAZİANTEP

04.10.1965

SANRE

Atatrk Cad. Kemaliye Mah. 25/3

Genel Müdür

Devlet Su İşleri A.O

......... .........

M.Ulku

G.ANTEB

10/04/1965

SANER

Atatürk Cd. Kemaliye Sok. No.25

Gen. Müdr.

G.Antep D.S.İ.

Page 11: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

– Where may dirtiness come from

the reasons of erraneous values

Human gives incomplete, about to be correct information

• Esendere Sk. Aşagidere Cikmazi No:42 D: 14 Levent İst

Asagidere Yokuşu D:14 Esendere Cd. 3.Levent ISTANBUL

Büyükdere Sko. Ihlamur Cad. Ş.Nedim Mha.

İhlamur Sokağı Büyükdere Cd. Şair Nedim Sok.

Page 12: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

– Where may dirtiness come from

the reasons of erraneous valuesinsufficient , incapable data collection instruments

• partial matching,

• fuzzy understanding,

• syntactic- semantic enrichment

continuous flow of data may cause data entrance faults

error or disruption in data transmission

Page 13: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

– Where may dirtiness come from

the reasons of inconsistent values• insufficient lookup mappings• incapable transformation infrastructures• different data sources

hard to prevent needs highly

specialized synchronization

and automation infrastructure

also we should care duplicate data (Redundancy)

Page 14: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

– Why pre-processing so importantData quality brings successful data mining

The Only Way to extract information from Data

Major tasks in Data Pre-processing:• Data cleaning

• Data integration

• Data transformation• Data reduction• Data discretization

Page 15: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing

Techniques

Page 16: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

Major tasks in Data Cleaning:– Fill in missing values– Identify outliers and smooth out noisy data– Correct inconsistent data– Resolve redundancy caused by data integration

Page 17: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

Major tasks in Data Cleaning:– Fill in missing values– Identify outliers and smooth out noisy data– Correct inconsistent data– Resolve redundancy caused by data integration

Page 18: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

How to handle missing data– simply do not accept it– fill it manually– fill it automatically:

» a global constant : e.g., “unknown”, a new class?!» the attribute mean» the attribute mean for all samples belonging to the

same class: smarter» the most probable value: inference-based such as

Bayesian formula or decision tree

Page 19: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

How to handle noisy data– Binning (discretization) method:

» first sort data and partition into (equi-depth) binsthen one can smooth by bin means, smooth by binmedian, smooth by bin boundaries, etc.» use data distribution and domain knowledge

– Clustering» detect and remove outliers

– Combined computer and human inspection» detect suspicious values and check by human (e.g.,deal with possible outliers)

– Regression» smooth by fitting the data into regression functions

– Model the data and infer the most probable values (difficult)

Page 20: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

Binning• Equal-width (distance) partitioning:

– divides the range into N intervals of equal size: uniform grid– if A and B are the lowest and highest values of the attribute, thewidth of intervals will be: W = (B –A)/N.– The most straightforward, but outliers may dominatepresentation– Skewed data is not handled well.

• Equal-depth (frequency) partitioning:– Divides the range into N intervals, each containing

approximately same number of samples– Good data scaling– Managing categorical attributes can be tricky.

Page 21: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

BinningSorted data (e.g., by price)– 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34• Partition into (equi-depth) bins:• Smoothing by bin means:• Smoothing by bin boundaries:

Page 22: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

BinningSorted data (e.g., by price)– 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34• Partition into (equi-depth) bins:– Bin 1: 4, 8, 9, 15– Bin 2: 21, 21, 24, 25– Bin 3: 26, 28, 29, 34• Smoothing by bin means:– Bin 1: 9, 9, 9, 9– Bin 2: 23, 23, 23, 23– Bin 3: 29, 29, 29, 29• Smoothing by bin boundaries:– Bin 1: 4, 4, 4, 15– Bin 2: 21, 21, 25, 25– Bin 3: 26, 26, 26, 34

Page 23: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

Regression

x

y

y = x + 1

X1

Y1

Y1’

Page 24: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

Clustering

Page 25: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

How to handle inconsistent data– systematic conversion, “transformation”– dynamic and interactive control mechanishms– redundancy detection and intelligent mapping

Page 26: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

Transformation– Smoothing: remove noise from data

– Aggregation: summarization, data cube construction

– Generalization: concept hierarchy climbing

– Normalization: scaled to fall within a small, specified range min-max normalization

» z-score normalization

» normalization by decimal scaling

– Attribute/feature construction: New attributes constructed from the given ones

Page 27: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Pre-processing: Information Extraction and Data Processing Techniques

Transformation– Smoothing: remove noise from data

– Aggregation: summarization, data cube construction

– Generalization: concept hierarchy climbing

– Normalization: scaled to fall within a small, specified range min-max normalization

» z-score normalization

» normalization by decimal scaling

– Attribute/feature construction: New attributes constructed from the given ones

Page 28: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Remember Stats Facts• Min:

– What is the big oh value for finding min of n-sized list ?

• Max:– What is the min number of comparisons needed to find

the max of n-sized list?

• Range:– What about simultaneous finding of min-max?

• Value Types:– Cardinal value -> how many, counting numbes– Nominal value -> names and identifies something– Ordinal value -> order of things, rank, position

)(3 medianmeanmodemean

Page 29: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Transformation• Min-max normalization: to [new_minA,

new_maxA]

– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped to

• Z-score normalization (μ: mean, σ: standard deviation):

• Ex. Let μ = 54,000, σ = 16,000. Then• Normalization by decimal scaling

716.00)00.1(000,12000,98

000,12600,73

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

A

Avv

'

j

vv

10' Where j is the smallest integer such that Max(|ν’|) < 1

225.1000,16

000,54600,73

Page 30: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Remember Stats Facts• Mean (algebraic measure) (sample vs. population):

– Weighted arithmetic mean:– Trimmed mean: chopping extreme values

• Median: A holistic measure– Middle value if odd number of values, or average of the

middle two values otherwise– Estimated by interpolation (for grouped data):

• Mode– Value that occurs most frequently in the data– Unimodal, bimodal, trimodal– Empirical formula:

n

iixn

x1

1

n

ii

n

iii

w

xwx

1

1

)(3 medianmeanmodemean

N

x

Page 31: Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Week 4-End

• read – Course Text Book Chapter 2