Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data

Data Mining

Lecture 4

Course Syllabus

• Course topics:• Data Management and Data Collection Techniques for

Data Mining Applications (Week3-Week4)– Data Warehouses: Gathering Raw Data from Relational

Databases and transforming into Information. – Information Extraction and Data Processing Techniques– Data Marts: The need for building highly specialized data

storages for data mining applications

• Case Study 1: Working and experiencing on the properties of The Retail Banking Data Mart (Week 4 –Assignment1)

Data Pre-processing: Information Extraction and Data Processing Techniques

• Why we should do pre-processing ?

• pre-processing takes %80 of the time

• Real world data is not perfect (dirty)– missing values (no data entrance)

• eg. %35 of Education Field is incomplete• eg. %20 of Birth Date is incomplete• eg. %45 of Work Title is incomplete• eg. %60 of Income is incomplete

Data Pre-processing: Information Extraction and Data Processing

TechniquesBÖLÜM ADI DEĞİŞKEN ADEDİATM 36 BİREYSEL KREDİLER 108 BİREYSEL SİGORTALAR 26 CALLCENTER 30 ÇEK 69 DEBIT KARTLAR 52 DEMOGRAFİK VERİLER 54 EKONOMİK VERİLER 402 FATURA ÖDEMELERİ 64 GAYRİ NAKDİ KREDİLER 48 HAZİNE BONOSU DEVLET TAHVİLİ 64 INTERNET 30 KREDİ KARTLARI 230 KREDİLİ MEVUAT HESABI 33 MAAS ODEMELERİ 17 POS 30 REPO 28 TİCARİ KREDİLER 68 TİCARİ SİGORTALAR 26 VADELİ MEVDUATLAR 77 VADESİZ MEVDUATLAR 318 YATIRIM FONLARI 106 DİĞER ÜRÜNLER 21 TOPLAM 1,937


Techniques

IND_CAROWNERGROUP computed data discreteINDCOMM_COUNTRY_HOUSE raw data discreteINDCOMM_COUNTRY_WORK raw data discreteINDCOMM_COUNTY_HOUSE raw data discreteINDCOMM_COUNTY_WORK raw data discreteINDCOMM_EDUCATIONLEVEL raw data discreteIND_EMPLOYEEFLAG computed data discrete booleanIND_GENDER raw data discreteINDCOMM_HABITANT_HOUSE computed data discreteINDCOMM_HABITANT_WORK computed data discreteIND_HOUSEHOLDINCOMEGROUP computed data discreteIND_HOUSEHOLDNUMBER computed data continuous integerIND_INCOMEGROUP computed data discreteIND_INTERNETFLAG computed data discrete booleanIND_MARITALSTATUS raw data discreteIND_MOBILEPHONEUSAGEFLAG computed data discrete boolean


TechniquesTABLO SAHA DOLULUK YUZDE VERİ DEĞERİ DOLULUK DURUMUMUSTERI_GERCEK EGITIM DURUMU 44% ÇOK KRİTİK ÇOK KRİTİKMUSTERI_GERCEK IS YERINDEKI UNVAN 39% ÇOK KRİTİK ÇOK KRİTİKMUSTERI_TUZEL ORTAKLIK TIPI 36% ÇOK KRİTİK ÇOK KRİTİKMUSTERI_MUSTERI DOGUMTARIHI 12% ÇOK KRİTİK KRİTİKMUSTERI_GERCEK MESLEK KODU 8% ÇOK KRİTİK AZ KRİTİKMUSTERI_GERCEK CINSIYET 4% ÇOK KRİTİK AZ KRİTİKMUSTERI_TUZEL FAALIYET ALANI 0% ÇOK KRİTİK DOLUMUSTERI_TUZEL IS SAHASI 0% ÇOK KRİTİK DOLUMUSTERI_MUSTERI TIP 0% ÇOK KRİTİK DOLUMUSTERI_GERCEK CALISMA DURUMU 41% KRİTİK ÇOK KRİTİKMUSTERI_MUSTERI GIRIS KANALI 36% KRİTİK ÇOK KRİTİKMUSTERI_GERCEK MEDENI DURUMU 18% KRİTİK KRİTİKMUSTERI_MUSTERI DOGUM YERI 5% KRİTİK AZ KRİTİKMUSTERI_TUZEL KURULUS TIPI GRUBU 0% KRİTİK DOLUMUSTERI_TUZEL KURULUS TIPI 0% KRİTİK DOLUMUSTERI_GERCEK SON OKUL ADI 99% AZ KRİTİK ÇOK KRİTİKMUSTERI_MUSTERI SEGMENT 98% AZ KRİTİK ÇOK KRİTİKMUSTERI_GERCEK NUFUSA KAYITLI IL 88% AZ KRİTİK ÇOK KRİTİK


– erraneous (noisy)• eg. Birth Date > current date or Birth Date <1850

(approx. %10 of the data) • eg. permissible values Education Field (C: college

U: university H: high school D: doctorate M: master S : secondary school P: primary school I : illegitimate) but X,Q,Y,T values may seen (approx. %10 of the data)

• Income field is negative (approx. %15 of the data)


– inconsistent- discrepancies in codes or names• eg. Birth Date =’01/01/1955’, 54 (same info but

different forms)• eg. Education Field coded

(C: college U: university H: high school D: doctorate M: master S : secondary school P: primary school I : illegitimate)

(5: college 3: university 4: high school 1: doctorate 2: master 6 : secondary school 7: primary school 8 : illegitimate)

• Income field continuous (3200 K) or interval based (3000-4000 K)


– Where may dirtiness come from

the reasons of missing valuesdifferent considerations in coding and analyzing

(discrepancies with time)

hardware/software problems

different sources not aligned with same data dictionary;

Field 1Field 2Field 3



Source 1 Source 2 Source 3




the reasons of erraneous values

Human gives incomplete, about to be correct information

AD

DOĞUM YERİ

DOĞUM TARİHİ

SOYAD

ADRESİ

ÇALIŞMA ÜNVANI

ÇALIŞMA YERİ

.........

Metin Ü.

GAZİANTEP

04.10.1965

SANRE

Atatrk Cad. Kemaliye Mah. 25/3

Genel Müdür

Devlet Su İşleri A.O

......... .........

M.Ulku

G.ANTEB

10/04/1965

SANER

Atatürk Cd. Kemaliye Sok. No.25

Gen. Müdr.

G.Antep D.S.İ.



the reasons of erraneous values

Human gives incomplete, about to be correct information

• Esendere Sk. Aşagidere Cikmazi No:42 D: 14 Levent İst

Asagidere Yokuşu D:14 Esendere Cd. 3.Levent ISTANBUL

Büyükdere Sko. Ihlamur Cad. Ş.Nedim Mha.

İhlamur Sokağı Büyükdere Cd. Şair Nedim Sok.



the reasons of erraneous valuesinsufficient , incapable data collection instruments

• partial matching,

• fuzzy understanding,

• syntactic- semantic enrichment

continuous flow of data may cause data entrance faults

error or disruption in data transmission



the reasons of inconsistent values• insufficient lookup mappings• incapable transformation infrastructures• different data sources

hard to prevent needs highly

specialized synchronization

and automation infrastructure

also we should care duplicate data (Redundancy)


– Why pre-processing so importantData quality brings successful data mining

The Only Way to extract information from Data

Major tasks in Data Pre-processing:• Data cleaning

• Data integration

• Data transformation• Data reduction• Data discretization


Techniques


Major tasks in Data Cleaning:– Fill in missing values– Identify outliers and smooth out noisy data– Correct inconsistent data– Resolve redundancy caused by data integration


Major tasks in Data Cleaning:– Fill in missing values– Identify outliers and smooth out noisy data– Correct inconsistent data– Resolve redundancy caused by data integration


How to handle missing data– simply do not accept it– fill it manually– fill it automatically:

» a global constant : e.g., “unknown”, a new class?!» the attribute mean» the attribute mean for all samples belonging to the

same class: smarter» the most probable value: inference-based such as

Bayesian formula or decision tree


How to handle noisy data– Binning (discretization) method:

» first sort data and partition into (equi-depth) binsthen one can smooth by bin means, smooth by binmedian, smooth by bin boundaries, etc.» use data distribution and domain knowledge

– Clustering» detect and remove outliers

– Combined computer and human inspection» detect suspicious values and check by human (e.g.,deal with possible outliers)

– Regression» smooth by fitting the data into regression functions

– Model the data and infer the most probable values (difficult)


Binning• Equal-width (distance) partitioning:

– divides the range into N intervals of equal size: uniform grid– if A and B are the lowest and highest values of the attribute, thewidth of intervals will be: W = (B –A)/N.– The most straightforward, but outliers may dominatepresentation– Skewed data is not handled well.

• Equal-depth (frequency) partitioning:– Divides the range into N intervals, each containing

approximately same number of samples– Good data scaling– Managing categorical attributes can be tricky.


BinningSorted data (e.g., by price)– 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34• Partition into (equi-depth) bins:• Smoothing by bin means:• Smoothing by bin boundaries:


BinningSorted data (e.g., by price)– 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34• Partition into (equi-depth) bins:– Bin 1: 4, 8, 9, 15– Bin 2: 21, 21, 24, 25– Bin 3: 26, 28, 29, 34• Smoothing by bin means:– Bin 1: 9, 9, 9, 9– Bin 2: 23, 23, 23, 23– Bin 3: 29, 29, 29, 29• Smoothing by bin boundaries:– Bin 1: 4, 4, 4, 15– Bin 2: 21, 21, 25, 25– Bin 3: 26, 26, 26, 34


Regression

x

y

y = x + 1

X1

Y1

Y1’


Clustering


How to handle inconsistent data– systematic conversion, “transformation”– dynamic and interactive control mechanishms– redundancy detection and intelligent mapping


Transformation– Smoothing: remove noise from data

– Aggregation: summarization, data cube construction

– Generalization: concept hierarchy climbing

– Normalization: scaled to fall within a small, specified range min-max normalization

» z-score normalization

» normalization by decimal scaling

– Attribute/feature construction: New attributes constructed from the given ones


Transformation– Smoothing: remove noise from data

– Aggregation: summarization, data cube construction

– Generalization: concept hierarchy climbing

– Normalization: scaled to fall within a small, specified range min-max normalization

» z-score normalization

» normalization by decimal scaling

– Attribute/feature construction: New attributes constructed from the given ones

Remember Stats Facts• Min:

– What is the big oh value for finding min of n-sized list ?

• Max:– What is the min number of comparisons needed to find

the max of n-sized list?

• Range:– What about simultaneous finding of min-max?

• Value Types:– Cardinal value -> how many, counting numbes– Nominal value -> names and identifies something– Ordinal value -> order of things, rank, position

)(3 medianmeanmodemean

Transformation• Min-max normalization: to [new_minA,

new_maxA]

– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped to

• Z-score normalization (μ: mean, σ: standard deviation):

• Ex. Let μ = 54,000, σ = 16,000. Then• Normalization by decimal scaling

716.00)00.1(000,12000,98

000,12600,73

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

A

Avv

'

j

vv

10' Where j is the smallest integer such that Max(|ν’|) < 1

225.1000,16

000,54600,73

Remember Stats Facts• Mean (algebraic measure) (sample vs. population):

– Weighted arithmetic mean:– Trimmed mean: chopping extreme values

• Median: A holistic measure– Middle value if odd number of values, or average of the

middle two values otherwise– Estimated by interpolation (for grouped data):

• Mode– Value that occurs most frequently in the data– Unimodal, bimodal, trimodal– Empirical formula:

n

iixn

x1

1

n

ii

n

iii

w

xwx

1

1

)(3 medianmeanmodemean

N

x

Week 4-End

• read – Course Text Book Chapter 2

Documents

Data Mining Lecture 4. Course Syllabus Course topics: Data Management and Data Collection Techniques for Data Mining Applications (Week3-Week4) –Data