Upload
audra-warner
View
226
Download
0
Tags:
Embed Size (px)
Citation preview
Data Mining
Lecture 4
Course Syllabus
• Course topics:• Data Management and Data Collection Techniques for
Data Mining Applications (Week3-Week4)– Data Warehouses: Gathering Raw Data from Relational
Databases and transforming into Information. – Information Extraction and Data Processing Techniques– Data Marts: The need for building highly specialized data
storages for data mining applications
• Case Study 1: Working and experiencing on the properties of The Retail Banking Data Mart (Week 4 –Assignment1)
Data Pre-processing: Information Extraction and Data Processing Techniques
• Why we should do pre-processing ?
• pre-processing takes %80 of the time
• Real world data is not perfect (dirty)– missing values (no data entrance)
• eg. %35 of Education Field is incomplete• eg. %20 of Birth Date is incomplete• eg. %45 of Work Title is incomplete• eg. %60 of Income is incomplete
Data Pre-processing: Information Extraction and Data Processing
TechniquesBÖLÜM ADI DEĞİŞKEN ADEDİATM 36 BİREYSEL KREDİLER 108 BİREYSEL SİGORTALAR 26 CALLCENTER 30 ÇEK 69 DEBIT KARTLAR 52 DEMOGRAFİK VERİLER 54 EKONOMİK VERİLER 402 FATURA ÖDEMELERİ 64 GAYRİ NAKDİ KREDİLER 48 HAZİNE BONOSU DEVLET TAHVİLİ 64 INTERNET 30 KREDİ KARTLARI 230 KREDİLİ MEVUAT HESABI 33 MAAS ODEMELERİ 17 POS 30 REPO 28 TİCARİ KREDİLER 68 TİCARİ SİGORTALAR 26 VADELİ MEVDUATLAR 77 VADESİZ MEVDUATLAR 318 YATIRIM FONLARI 106 DİĞER ÜRÜNLER 21 TOPLAM 1,937
Data Pre-processing: Information Extraction and Data Processing
Techniques
IND_CAROWNERGROUP computed data discreteINDCOMM_COUNTRY_HOUSE raw data discreteINDCOMM_COUNTRY_WORK raw data discreteINDCOMM_COUNTY_HOUSE raw data discreteINDCOMM_COUNTY_WORK raw data discreteINDCOMM_EDUCATIONLEVEL raw data discreteIND_EMPLOYEEFLAG computed data discrete booleanIND_GENDER raw data discreteINDCOMM_HABITANT_HOUSE computed data discreteINDCOMM_HABITANT_WORK computed data discreteIND_HOUSEHOLDINCOMEGROUP computed data discreteIND_HOUSEHOLDNUMBER computed data continuous integerIND_INCOMEGROUP computed data discreteIND_INTERNETFLAG computed data discrete booleanIND_MARITALSTATUS raw data discreteIND_MOBILEPHONEUSAGEFLAG computed data discrete boolean
Data Pre-processing: Information Extraction and Data Processing
TechniquesTABLO SAHA DOLULUK YUZDE VERİ DEĞERİ DOLULUK DURUMUMUSTERI_GERCEK EGITIM DURUMU 44% ÇOK KRİTİK ÇOK KRİTİKMUSTERI_GERCEK IS YERINDEKI UNVAN 39% ÇOK KRİTİK ÇOK KRİTİKMUSTERI_TUZEL ORTAKLIK TIPI 36% ÇOK KRİTİK ÇOK KRİTİKMUSTERI_MUSTERI DOGUMTARIHI 12% ÇOK KRİTİK KRİTİKMUSTERI_GERCEK MESLEK KODU 8% ÇOK KRİTİK AZ KRİTİKMUSTERI_GERCEK CINSIYET 4% ÇOK KRİTİK AZ KRİTİKMUSTERI_TUZEL FAALIYET ALANI 0% ÇOK KRİTİK DOLUMUSTERI_TUZEL IS SAHASI 0% ÇOK KRİTİK DOLUMUSTERI_MUSTERI TIP 0% ÇOK KRİTİK DOLUMUSTERI_GERCEK CALISMA DURUMU 41% KRİTİK ÇOK KRİTİKMUSTERI_MUSTERI GIRIS KANALI 36% KRİTİK ÇOK KRİTİKMUSTERI_GERCEK MEDENI DURUMU 18% KRİTİK KRİTİKMUSTERI_MUSTERI DOGUM YERI 5% KRİTİK AZ KRİTİKMUSTERI_TUZEL KURULUS TIPI GRUBU 0% KRİTİK DOLUMUSTERI_TUZEL KURULUS TIPI 0% KRİTİK DOLUMUSTERI_GERCEK SON OKUL ADI 99% AZ KRİTİK ÇOK KRİTİKMUSTERI_MUSTERI SEGMENT 98% AZ KRİTİK ÇOK KRİTİKMUSTERI_GERCEK NUFUSA KAYITLI IL 88% AZ KRİTİK ÇOK KRİTİK
Data Pre-processing: Information Extraction and Data Processing Techniques
– erraneous (noisy)• eg. Birth Date > current date or Birth Date <1850
(approx. %10 of the data) • eg. permissible values Education Field (C: college
U: university H: high school D: doctorate M: master S : secondary school P: primary school I : illegitimate) but X,Q,Y,T values may seen (approx. %10 of the data)
• Income field is negative (approx. %15 of the data)
Data Pre-processing: Information Extraction and Data Processing Techniques
– inconsistent- discrepancies in codes or names• eg. Birth Date =’01/01/1955’, 54 (same info but
different forms)• eg. Education Field coded
(C: college U: university H: high school D: doctorate M: master S : secondary school P: primary school I : illegitimate)
(5: college 3: university 4: high school 1: doctorate 2: master 6 : secondary school 7: primary school 8 : illegitimate)
• Income field continuous (3200 K) or interval based (3000-4000 K)
Data Pre-processing: Information Extraction and Data Processing Techniques
– Where may dirtiness come from
the reasons of missing valuesdifferent considerations in coding and analyzing
(discrepancies with time)
hardware/software problems
different sources not aligned with same data dictionary;
Field 1Field 2Field 3
Field 1Field 2Field 3
Field 1Field 2Field 3
Source 1 Source 2 Source 3
Field 1Field 2Field 3
Data Pre-processing: Information Extraction and Data Processing Techniques
– Where may dirtiness come from
the reasons of erraneous values
Human gives incomplete, about to be correct information
AD
DOĞUM YERİ
DOĞUM TARİHİ
SOYAD
ADRESİ
ÇALIŞMA ÜNVANI
ÇALIŞMA YERİ
.........
Metin Ü.
GAZİANTEP
04.10.1965
SANRE
Atatrk Cad. Kemaliye Mah. 25/3
Genel Müdür
Devlet Su İşleri A.O
......... .........
M.Ulku
G.ANTEB
10/04/1965
SANER
Atatürk Cd. Kemaliye Sok. No.25
Gen. Müdr.
G.Antep D.S.İ.
Data Pre-processing: Information Extraction and Data Processing Techniques
– Where may dirtiness come from
the reasons of erraneous values
Human gives incomplete, about to be correct information
• Esendere Sk. Aşagidere Cikmazi No:42 D: 14 Levent İst
Asagidere Yokuşu D:14 Esendere Cd. 3.Levent ISTANBUL
Büyükdere Sko. Ihlamur Cad. Ş.Nedim Mha.
İhlamur Sokağı Büyükdere Cd. Şair Nedim Sok.
Data Pre-processing: Information Extraction and Data Processing Techniques
– Where may dirtiness come from
the reasons of erraneous valuesinsufficient , incapable data collection instruments
• partial matching,
• fuzzy understanding,
• syntactic- semantic enrichment
continuous flow of data may cause data entrance faults
error or disruption in data transmission
Data Pre-processing: Information Extraction and Data Processing Techniques
– Where may dirtiness come from
the reasons of inconsistent values• insufficient lookup mappings• incapable transformation infrastructures• different data sources
hard to prevent needs highly
specialized synchronization
and automation infrastructure
also we should care duplicate data (Redundancy)
Data Pre-processing: Information Extraction and Data Processing Techniques
– Why pre-processing so importantData quality brings successful data mining
The Only Way to extract information from Data
Major tasks in Data Pre-processing:• Data cleaning
• Data integration
• Data transformation• Data reduction• Data discretization
Data Pre-processing: Information Extraction and Data Processing
Techniques
Data Pre-processing: Information Extraction and Data Processing Techniques
Major tasks in Data Cleaning:– Fill in missing values– Identify outliers and smooth out noisy data– Correct inconsistent data– Resolve redundancy caused by data integration
Data Pre-processing: Information Extraction and Data Processing Techniques
Major tasks in Data Cleaning:– Fill in missing values– Identify outliers and smooth out noisy data– Correct inconsistent data– Resolve redundancy caused by data integration
Data Pre-processing: Information Extraction and Data Processing Techniques
How to handle missing data– simply do not accept it– fill it manually– fill it automatically:
» a global constant : e.g., “unknown”, a new class?!» the attribute mean» the attribute mean for all samples belonging to the
same class: smarter» the most probable value: inference-based such as
Bayesian formula or decision tree
Data Pre-processing: Information Extraction and Data Processing Techniques
How to handle noisy data– Binning (discretization) method:
» first sort data and partition into (equi-depth) binsthen one can smooth by bin means, smooth by binmedian, smooth by bin boundaries, etc.» use data distribution and domain knowledge
– Clustering» detect and remove outliers
– Combined computer and human inspection» detect suspicious values and check by human (e.g.,deal with possible outliers)
– Regression» smooth by fitting the data into regression functions
– Model the data and infer the most probable values (difficult)
Data Pre-processing: Information Extraction and Data Processing Techniques
Binning• Equal-width (distance) partitioning:
– divides the range into N intervals of equal size: uniform grid– if A and B are the lowest and highest values of the attribute, thewidth of intervals will be: W = (B –A)/N.– The most straightforward, but outliers may dominatepresentation– Skewed data is not handled well.
• Equal-depth (frequency) partitioning:– Divides the range into N intervals, each containing
approximately same number of samples– Good data scaling– Managing categorical attributes can be tricky.
Data Pre-processing: Information Extraction and Data Processing Techniques
BinningSorted data (e.g., by price)– 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34• Partition into (equi-depth) bins:• Smoothing by bin means:• Smoothing by bin boundaries:
Data Pre-processing: Information Extraction and Data Processing Techniques
BinningSorted data (e.g., by price)– 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34• Partition into (equi-depth) bins:– Bin 1: 4, 8, 9, 15– Bin 2: 21, 21, 24, 25– Bin 3: 26, 28, 29, 34• Smoothing by bin means:– Bin 1: 9, 9, 9, 9– Bin 2: 23, 23, 23, 23– Bin 3: 29, 29, 29, 29• Smoothing by bin boundaries:– Bin 1: 4, 4, 4, 15– Bin 2: 21, 21, 25, 25– Bin 3: 26, 26, 26, 34
Data Pre-processing: Information Extraction and Data Processing Techniques
Regression
x
y
y = x + 1
X1
Y1
Y1’
Data Pre-processing: Information Extraction and Data Processing Techniques
Clustering
Data Pre-processing: Information Extraction and Data Processing Techniques
How to handle inconsistent data– systematic conversion, “transformation”– dynamic and interactive control mechanishms– redundancy detection and intelligent mapping
Data Pre-processing: Information Extraction and Data Processing Techniques
Transformation– Smoothing: remove noise from data
– Aggregation: summarization, data cube construction
– Generalization: concept hierarchy climbing
– Normalization: scaled to fall within a small, specified range min-max normalization
» z-score normalization
» normalization by decimal scaling
– Attribute/feature construction: New attributes constructed from the given ones
Data Pre-processing: Information Extraction and Data Processing Techniques
Transformation– Smoothing: remove noise from data
– Aggregation: summarization, data cube construction
– Generalization: concept hierarchy climbing
– Normalization: scaled to fall within a small, specified range min-max normalization
» z-score normalization
» normalization by decimal scaling
– Attribute/feature construction: New attributes constructed from the given ones
Remember Stats Facts• Min:
– What is the big oh value for finding min of n-sized list ?
• Max:– What is the min number of comparisons needed to find
the max of n-sized list?
• Range:– What about simultaneous finding of min-max?
• Value Types:– Cardinal value -> how many, counting numbes– Nominal value -> names and identifies something– Ordinal value -> order of things, rank, position
)(3 medianmeanmodemean
Transformation• Min-max normalization: to [new_minA,
new_maxA]
– Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then $73,600 is mapped to
• Z-score normalization (μ: mean, σ: standard deviation):
• Ex. Let μ = 54,000, σ = 16,000. Then• Normalization by decimal scaling
716.00)00.1(000,12000,98
000,12600,73
AAA
AA
A
minnewminnewmaxnewminmax
minvv _)__('
A
Avv
'
j
vv
10' Where j is the smallest integer such that Max(|ν’|) < 1
225.1000,16
000,54600,73
Remember Stats Facts• Mean (algebraic measure) (sample vs. population):
– Weighted arithmetic mean:– Trimmed mean: chopping extreme values
• Median: A holistic measure– Middle value if odd number of values, or average of the
middle two values otherwise– Estimated by interpolation (for grouped data):
• Mode– Value that occurs most frequently in the data– Unimodal, bimodal, trimodal– Empirical formula:
n
iixn
x1
1
n
ii
n
iii
w
xwx
1
1
)(3 medianmeanmodemean
N
x
Week 4-End
• read – Course Text Book Chapter 2