Machine Learning Practice using Weka · 2016-06-28 · Weka. 소개. 대표적인기계학습알고리즘모음, 데이터마이닝도구 . Weka. 의주요기능. . 데이터전처리(data

서울대학교바이오지능연구실

김병희

Machine Learning Practice using Weka

2016-06-27

Outline

실습 1: Weka와친숙해지기

실습 2: Weka 기반기계학습익히기

참고자료: Data Mining & Machine Learning

using Weka

(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 2

김병희

실습 #1


Outline

Weka Explorer를이용한전처리및기초분석 Filter, Visualize Dataset: diabetes

Weka Explorer를이용한 Classification Dataset: Iris, diabetes Classifier: 결정트리(ID3, J48, SimpleCart), Random Forest

Weka Experimenter를이용한일괄분석 Dataset: diabetes T-test를이용한모델성능비교

실습수업제출결과물안내


실습 1 구성

분류문제바로풀어보기 문제: 붓꽃종류판별, 당뇨병진단 목표: 예측적 핵심과정: 통계적예측/모델링, 결과해석 도구: Weka의 Explorer, Experimenter

분류문제를더잘이해하고풀기위한사전작업 목표: 탐색적 핵심과정: 탐색적데이터분석 주요작업

데이터전처리 (요인별분류기여도평가및선별) 데이터가시화

도구: Weka의 Explorer


Weka 소개

대표적인기계학습알고리즘모음, 데이터마이닝도구Weka의주요기능

데이터전처리(data pre-processing), 특성값선별(feature selection) 군집화(clustering), 가시화(visualization) 분류(classification), 회귀분석(regression), 시계열예측(forecast) 연관규칙학습(association rules)

S/W 특성 무료및소스공개소프트웨어(free & open source GNU General Public License) 주요 analytics S/W의모체가됨: RapidMiner, MOA Java로구현. 다양한플랫폼에서실행가능 Python, C, R, Matlab 등주요도구와연동가능

다운로드 Google에서Weka로검색, 첫번째검색결과 http://www.cs.waikato.ac.nz/ml/weka/

(c)2008-2016, SNU Biointelligence Lab. 6Weka (bird): http://www.arkive.org/weka/gallirallus-australis/video-au00.html

http://www.cs.waikato.ac.nz/ml/weka/

http://www.arkive.org/weka/gallirallus-australis/video-au00.html

Top 20 Most Popular Tools for Big Data, Data Mining, and Data Science

(c)2008-2016, SNU Biointelligence Lab. 7Source: http://www.kdnuggets.com/2015/06/data-mining-data-science-tools-associations.html

Red: Free/Open Source toolsGreen: Commercial toolsFuchsia: Hadoop/Big Data tools

http://www.kdnuggets.com/2015/06/data-mining-data-science-tools-associations.html

Weka를구성하는인터페이스

(c)2008-2016, SNU Biointelligence Lab. 8

Explorer: 다양한분석작업을한단계씩분석수행및결과확인가능. 일반적으로, 가장먼저실행

KnowledgeFlow: 데이터처리과정의주요모듈을그래프로가시화하여구성하고실험수행

Experimenter: 분류및회귀분석을일괄처리. 결과비교분석.- 다양한알고리즘및파라미터설정- 여러데이터-알고리즘조합동시분석- 분석모델간통계적비교- 대규모통계적실험수행

Simple CLI: 다른인터페이스를컨트롤하는스크립트입력창. Weka의모든기능을명령어로수행가능

그 외 주요 도구

Workbench: 다른인터페이스를통합한통합인터페이스 (3.8.0부터등장)

사례: 붓꽃(iris)분류(classification)


Iris virginicaIris versicolorIris setosa

분류에 사용할 특성값

http://en.wikipedia.org/wiki/File:Kosaciec_szczecinkowaty_Iris_setosa.jpg


http://en.wikipedia.org/wiki/File:Iris_versicolor_3.jpg


http://en.wikipedia.org/wiki/File:Iris_virginica.jpg


http://en.wikipedia.org/wiki/File:Petal-sepal.jpg


사례: 붓꽃(iris)분류(classification)

특성값정의(Define features or attributes) Sepal length, sepal width, petal length, petal width 분류라벨(Class label): 붓꽃(iris)의세아종. Setosa, versicolor, 및 virginica

샘플수집및데이터구성 각붓꽃아종별로 50개씩샘플수집 (1935년) Data table : 150 samples (or instances) * 5 attributes R. Fisher 경은 1936년발표논문에서이데이터에 linear discriminant model 을적용함

학습: 분류알고리즘선택및파라미터설정 세가지분류알고리즘으로실습: 신경망, 결정트리, SVM 각알고리즘별파라미터설정은기본값을적용

학습결과평가및모델선정 다양한평가척도확인 평가척도를기준으로학습결과모델(algorithm + parameter setting) 비교및선정


http://en.wikipedia.org/wiki/Linear_discriminant_analysis

실습: 붓꽃(iris) 데이터셋

Just open “iris.arff” in ‘data’ folder


Weka 데이터 양식 (.ARFF)

@RELATION iris@ATTRIBUTE sepallength REAL@ATTRIBUTE sepalwidth real@ATTRIBUTE petallength NUMERIC@ATTRIBUTE petalwidth numeric @ATTRIBUTE class {Iris-setosa, Iris-versicolor, Iris-virginica}@DATA

5.1, 3.5, 1.4, 0.2, Iris-setosa4.9, 3.0, 1.4, 0.2, Iris-setosa4.7, 3.2, 1.3, 0.2, Iris-setosa…7.0, 3.2, 4.7, 1.4, Iris-versicolor6.4, 3.2, 4.5, 1.5, Iris-versicolor6.9, 3.1, 4.9, 1.5, Iris-versicolor…

데이터(CSV format)

헤더

12Note: Excel을 이용하여 CSV 파일 생성 후, 헤더만 추가하면 쉽게 arff 포맷의 파일 생성 가능

(c)2008-2016, SNU Biointelligence Lab.

Dataset name Attribute name Attribute type

ARFF Example

13

%% ARFF file for weather data with some numeric features%@relation weather

@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {true, false}@attribute play? {yes, no}

@datasunny, 85, 85, false, nosunny, 80, 90, true, noovercast, 83, 86, false, yes...

Slide from Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)(c)2008-2016, SNU Biointelligence Lab.

Preprocess 탭에서탐색적데이터분석수행

데이터구성(current relation) 특성값삭제(remove attributes) 특성값별기초적통계분석(selected attribute) 모든특성값을대상으로 class label 분포가시화(Visualize All) ‘Filter’를이용한 preprocessing


예측 모델 학습 및 평가 과정 예시

?

? ?

?

?

Data matrix(row: instance)(col: feature)

Feature Selection

PCA

SVM

Decision Tree

Neural Networks

Accuracy +

Cross-validation

ROC

Curve

AUC

(Area Under ROC

Curve)

feature

inst

ance

normalization

Dataset준비/Cleaning

FeatureManipulation

ClassificationRegression Evaluation

Fill missing values

standardization

문제 해결을 위한 다양한 프로세스 구성 및 테스트


실습: 피마인디안당뇨병진단

Description Pima Indians have the highest prevalence of diabetes in the world We will build classification models that diagnose if the patient shows signs of

diabetes http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

Configuration of the data set 768 instances 8 attributes

age, number of times pregnant, results of medical tests/analysis all numeric (integer or real-valued)

Class label = 1 (Positive example ) Interpreted as "tested positive for diabetes" 268 instances

Class label = 0 (Negative example) 500 instances


http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes

http://upload.wikimedia.org/wikipedia/commons/6/6f/Pima.jpg

http://upload.wikimedia.org/wikipedia/commons/6/6f/Pima.jpg

WEKA를이용한데이터전처리

및기술적분석(DESCRIPTIVE ANALYSIS)

Part Ⅰ


Preprocess 탭에서기술적분석수행

데이터구성(current relation) 특성값삭제(remove attributes) 특성값별기초적통계분석(selected attribute) 모든특성값을대상으로 class label 분포가시화(Visualize All) ‘Filter’를이용한 preprocessing: Part V에서설명


Weka - Explorer – Preprocess –Filter를이용한데이터전처리

주요연습대상전처리기 Fill in missing values

weka.filters.unsupervised.attribute.ReplaceMissingValues이용 Standardization for all the attributes: x를 z 로변환

weka.filters.unsupervised.attribute.Standardize이용 Data reduction using PCA

weka.filters.unsupervised.attribute.PrincipalComponents이용 파라미터중 maximumAttributes를 10~50 사이중하나의숫자로임의로설정

Check the effect of PCA using ‘Visualize-Plot Matrix’ 다음장


Weka - Explorer – Visualize 를이용한기술적분석 (descriptive analysis)

Check the effect of PCA using ‘Visualize-Plot Matrix’ PCA 적용전/후두 plot matrix를대조하여 PCA의효과를확인하는작업

PCA 적용전(401차원데이터) 및후(주성분의수만큼차원축소된데이터)에대한 Plot Matrix 화면을각각캡처하고, 비교및해석을수행


WEKA EXPLORER를이용한

CLASSIFICATION

Part Ⅱ


22

분류알고리즘 – 결정트리(Decision Trees)

J48 (C4.5의 Java 구현버전)

학습결과모델에서분류규칙을 ‘트리’ 형태로얻을수있음 Weka에서찾아가기: classifiers-trees-J48

(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr

23

비교: 신경망(Neural Networks)

MLP (Multilayer Perceptron) 실용적으로매우폭넓게쓰이는대표적분류알고리즘 Weka에서찾아가기: classifiers-functions-MultilayerPerceptron


Figure from Andrew Ng’s Machine Learning Lecture Notes, on Coursera, 2013-1

Weka로분류수행하기 – 신경망의예

24

click • load a file that contains the training data by clicking ‘Open file’ button

• ‘ARFF’ or ‘CSV’ formats are readable

• Click ‘Classify’ tab• Click ‘Choose’ button• Select ‘weka – function - MultilayerPerceptron

• Click ‘MultilayerPerceptron’ • Set parameters for MLP• Set parameters for Test• Click ‘Start’ for learning


25

분류 알고리즘의 파라미터 설정

파라미터설정(Parameter Setting) = 자동차튜닝(Car Tuning) 많은경험또는시행착오필요 파라미터설정에따라동일한알고리즘에서도최악에서최고의성능을모두보일수도있다

결정트리의주요파라미터 (J48, SimpleCart in Weka) 트리의크기에직접적영향을주는파라미터: confidenceFactor, pruning,

minNumObj 등

Random Forest의주요파라미터 (RandomForest in Weka) numTrees: 학습및예측에참여할 tree의수를지정. 대체로많을수록좋으나, overfitting에주의해야한다.

참고: 신경망의주요파라미터 (MultilayerPerceptron in Weka) 구조관련: hiddenLayers, 학습과정관련: learningRate, momentum, trainingTime (epoch), seed


Test Options and Classifier Output

26

There are various metrics for evaluation

Setting the data set used for evaluation


Classifier Output

Run information

Classifier model (full training set)

Evaluation results General summary Detailed accuracy by

class Confusion matrix


The output depends on the classifier

WEKA EXPERIMENTER를

이용한 CLASSIFICATION

Part Ⅲ


Using Experimenter in Weka

Tool for ‘Batch’ experiments

29

click

• Set experiment type/iteration control

• Set datasets / algorithms

Click ‘New’

• Select ‘Run’ tab and click ‘Start’• If it has finished successfully, click

‘Analyse’ tab and see the summary


Usages of the Experimenter

Model selection for classification/regression Various approaches

Repeated training/test set split Repeated cross-validation (c.f. double cross-validation) Averaging

Comparison between models / algorithms Paired t-test On various metrics: accuracies / RMSE / etc.

Batch and/or Distributed processing Load/save experiment settings http://weka.wikispaces.com/Remote+Experiment Multi-core support : utilize all the cores on a multi-core machine


http://weka.wikispaces.com/Remote+Experiment

Experimenter 실습


Experimenter 실습


Experimenter 실습

실험결과화면예시(Analyse 탭에서 1~4 순서대로선택)

(C) 2014-2015, B.-H Kim 33

• Accuracy: percent_correct 선택• F1-measure: F_measure 선택• ROC Area: Area Under ROC 선택

1

2

3

4

참고: Package Manager

Explorer에다양한기능추가가능


참고: 데이터가시화Data Visualization

Part Ⅳ


가시화 (Visualization)

Descriptive analysis Scatter plot & correlation analysis among features



Classification 학습결과모델

tree, graph boundary

모델의종합적성능평가: ROC curve (threshold curve) cost curve

Classifier errors


trees.j48의예 bayes.BayesNet - TAN의예

실습결과정리양식


문제및데이터셋구성 (슬라이드 1장)

분류문제 (문제의명칭및간단한설명)

데이터셋구성 크기

# instances = ? # attributes = ?

Attribute 목록 (이름및 attribute의 type을아래에간략히기재)

Training/test set 구성 (둘중하나를선택하여작성)

선택 1: training set 66%, test set 34% (random seed = *) 선택 2: k-fold cross validation


전처리및분류알고리즘 (슬라이드 1장)

전처리 (적용한전처리기명칭과의도를간단히기재. 두가지이상가능)

분류알고리즘: 결정트리 ID3 J48 (C 4.5의자바버전) CART (classification and regression tree) Random Forest

테스트한프로세스및핵심파라미터설정내역 작성예 1: discretization – ID3 작성예 2: PCA (5) – J48 / CART 작성예 3: Random Forest, # trees = 5,10,20,30


학습결과 (슬라이드 1장)

각결정트리별다음을요약정리 분류성능

Accuracy = ? Precision = ? , Recall = ? ROC Area = ?

Confusion Matrix

결과분석, 논의(간략하게)

(여기에학습결과결정트리를복사하여붙여넣기. 1~2개)

(random forest를선택한경우는, 학습결과 trees 캡처는하지말고, #trees에따른성능변화에초점을두고분석)


Experimenter를이용한일괄처리 – 설정(슬라이드 1장)

문제: Pima Indians diabetes(Experimenter의실험설정최종화면을캡쳐하여여기에넣을것)


Experimenter를이용한일괄처리 – 결과(슬라이드 1장)

Analysis 결과화면중에서성능비교표와인덱스를캡쳐하여여기에넣을것

결과에대해요약, 분석할것


요약, 결론, 실습소감, 질문 (슬라이드 1장)

실습요약

그외소감이나질문이있으면이슬라이드내에서자유롭게정리


김병희

실습 #2


실습 2 구성

주요 Classifier 연습 Ensemble methods ANN, SVM

Weka 익숙하게다루기 Explorer Meta tools in weka.classifiers Experimenter

연습


Practice Outline

Ensemble Methods Boosting + Decision stump vs. J48 vs. Random Forest Boosting의주요파라미터설정에따른결과비교

주요 Classifier 연습: 파라미터설정 ANN: learning rate, momentum, #iteration SVM: PolyKernel (exponent), RBF kernel (gamma + C) Experimenter로 TAN/ANN/SVM/kNN수행및비교


Dataset: Spam Classification

Description Many email services today provide

spam filters that are able to classify emails into spam and non-spam email

You will be training a classifier to classify whether a given email, x, is spam or non-spam

Configuration of the data set 1899 terms to check spams All terms are binary which means the

term exists or not 1899 binary attributes Binary class label 4000 emails in Training set 1000 emails in Test set

(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 48

Preprocessing and Normalization steps which were applied to the dataset

Training / test set 설정방법

smapTrain.arff를 Preprocess – Open 메뉴로열기

(C) 2014-2015, B.-H Kim 49


smapTest.arff를 Classify – Test options – Supplied test set 메뉴에서열기

(C) 2014-2015, B.-H Kim 50

이창을그대로띄워두거나close를눌러서닫아도됨


이제 classifier를선정하고학습을수행하면, training set을기준으로모델을학습하고, test set에대한평가결과가출력된다.

(C) 2014-2015, B.-H Kim 51

Dataset: 필기체 숫자 인식 (MNIST)

Description The MNIST database of handwritten digits contains digits written by office

workers and students We will build a recognition model based on classifiers with the reduced set of

MNIST http://yann.lecun.com/exdb/mnist/

Configuration of the data set For our practice, we use a ‘subset’ of the MNIST set

Full MNIST set contains 60,000 training and10,000 test samples

500 samples are randomly sampled for this practice Attributes

pixel values in gray level in a 20x20 image 400 attributes (real-valued)

Class attribute: 0~9, which represent digits from 0 to 9

52(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr

http://yann.lecun.com/exdb/mnist/

실습 – Feature Manipulation Attribute Selection 또는 PCA를이용한차원축소를적용한후 5가지이상 classification 모델로분류시도

분류결과요약및 3가지이상척도를기준을적용하여최종모델 1건선정

(C) 2014-2015, B.-H Kim 53

Attribute Selection 적용후분류하기 Classifier로 meta-AttributeSelectedClassifier 선택

Classifier 선택 Evaluator로 InfoGainAttributeEval또는

GainRatioAttributeEval을선택하고, search 방법으로‘Ranker’를선택

Test options에서 ‘supplied test set’ 선택하고, spamTest.arff를로드

PCA 적용후분류하기 Classifier로meta-FilteredClassifier적용

Classifier 선택 Filter에서 PrincipalComponents 적용

Test options에서 ‘supplied test set’ 선택하고, spamTest.arff를로드

실습 – 다양한분류프로세스적용

분류실험설정요약보고 프로세스 5가지이상의설정에대한요약보고

차원축소알고리즘최소 2가지적용(예: InfoGain, PCA) 서로다른 Classifier 최소 2가지적용(예: J48, NaiveBayes)

글로정리하거나, 작성예: Attribute Selection: InfoGain – Classifier: J48

우측그림과같이다이어그램을그릴것

이후각 [차원축소-classifier]조합별로별도의섹션을구성하고, 다음을기재 차원축소알고리즘및파라미터설정내역 Classifier 알고리즘및파라미터설정내역

실험결과요약 5가지이상의설정에대해다음의분류결과성능을표로정리

Accuracy, precision, recall, F-measure, ROC Area 최적의설정에대한보고: [차원축소-classifier] 조합및각각의파라미터설정을

기재

(C) 2014-2015, B.-H Kim 54

실습 - 주의사항

주의 1: 수행 중 중단되는 경우, weka 실행을 위한 java heap 메모리를보다 크게 설정할 것.

Weka가 설치된 폴더에 있는 ‘RunWeka.ini’ 파일에서 메모장등을 이용해 ‘maxheap’ 항목을 다음과 같이 수정: maxheap=1024M (또는 2048M)

주의 2: 데이터가 큰 만큼 대부분의 작업에 수행에시간이 상당히 소요됨 미리 Weka를 돌려보기 시작할 것 Classification 작업의 경우 Experimenter를 활용하는 것이 한가지 방법

(C) 2014-2015, B.-H Kim 55

WEKA를이용한

ENSEMBLE LEARNING

Practice Ⅰ


Boosting: AdaBoost.M1, RealAdaBoost

Boosting in Weka AdaBoost.M1 for multiclass problem 참고: RealAdaBoost for binary lass problem (Weka 3.7.9 이상에서추가설치가능) Several boosting tools in the ‘meta’ category

Major Options for boosting classifier: select a classifier. Usually, a ‘weak learner’ is enough numIterations: usually, the longer, the better. shrinkage: 1.0 means no shrinkage in the step size. Usually set around 0.1.


대표적 CLASSIFIER 연습

ANN & SVM

Practice Ⅱ



59






인공신경망 (ANN) 연습

ANN의주요파라미터설명(functions-MultilayerPerceptron기준) learningRate -- The amount the weights are updated. momentum -- Momentum applied to the weights during updating. hiddenLayers –

This defines the hidden layers of the neural network. This is a list of positive whole numbers. 1 for each hidden layer. Comma seperated.

Ex) 3: one hidden layer with 3 hidden nodes Ex) 5,3; two hidden layers with 5 and 3 hidden nodes, respectively

To have no hidden layers put a single 0 here. This will only be used if autobuild is set. There are also wildcard values 'a' = (attribs + classes) / 2, 'i' = attribs, 'o' = classes , 't' = attribs + classes.

trainingTime -- The number of epochs to train through. If the validation set is non-zero then it can terminate the network early

Experiments 실습목표: 주요파라미터의효과를확인하기위해극단적인값을설정하고성능

평가수행


SVM 연습

SVM의주요파라미터설명(functions-SMO 기준) c -- The complexity parameter C. kernel -- The kernel to use.

PolyKernel -- The polynomial kernel : K(x, y) = <x, y>^p or K(x, y) = (<x, y>+1)^p.

“exponent” represents p in the equations. RBFKernel -- K(x, y) = e^-(gamma * <x-y, x-y>^2)

gamma (γ) controls the width (range of neighborhood) of the kernel

Experiments 실습목표: 대표적커널연습. 커널별핵심파라미터설정연습 PolyKernel: testing several exponents. {1, 2, 5} RBF kernel: “grid-search" on C and γ using cross-validation.

C = {0.1, 1, 10}, γ = {0.1, 1, 10}

Reference A practical guide to SVM classification (http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf)


http://www.csie.ntu.edu.tw/%7Ecjlin/papers/guide/guide.pdf

http://www.csie.ntu.edu.tw/%7Ecjlin/papers/guide/guide.pdf

참고자료




63

click

• Results Destination: set the log file that records prediction results

• Set Experiment Type/Iteration Control• Set Datasets / Algorithms

Click ‘New’


‘Analyse’ tab, get experimental logs, and see the summary of model comparison


참고: Package Manager

Explorer에다양한기능추가가능(Weka 3.7 이상)


참고: More TIPS on ‘meta’ Tools

Class label이 ordinal이라면 Meta-OrdinalClassClassifier를살펴보자

예) Abalone 데이터: 나이의 10단계로구분하기 예) HeartDisease데이터: 심장병의 5단계구분

Outlier 탐지, 또는 Positive 탐지만이명확하게의미가부여되는경우 Meta-OneClassClassifier를살펴보자

Cross-validation을이용한최적의파라미터탐색을빠르게하고싶다면 Meta-CVParameterSelection을살펴보자


실습결과정리양식


실습 1: Ensemble of Decision Trees

실험설정(각알고리즘에서 default setting 적용) Dataset: MNIST_handwritten_digit_subset_500.arff Test option: 5-fold cross validation Setting 1-1, 1-2: AdaBoostM1, classifier로 Decision Stump/J48 지정 Setting 2: J48 Setting 3: Random Forest

실험결과 (Weighted Avg. 기재) (아래표를채울것)

분석및논의 (두항목이상기재)


Setting Accuracy F1 Measure ROC Area

1-1

1-2

2

3

실습 2: 인공신경망 (ANN) 연습

실험설정(각설정별지정한 parameter 외의값은 default 적용) Dataset: MNIST_handwritten_digit_subset_500.arff Test option: 5-fold cross validation Classifier: functions - MultilayerPerceptron Parameters

실험결과 (Weighted Avg. 기재) (아래표를채울것)




1

2

3

Setting hiddenLayers learningRate Momentum trainingTime

1 10 0.3 0.2 50

2 10 0.9 0.5 50

3 10, 7 0.3 0.2 50

실습 3: SVM 연습

실험설정(각설정별테스트할 parameter 외의값은 default 적용) Dataset: MNIST_handwritten_digit_subset_500.arff Test option: 5-fold cross validation Classifier: functions - SMO Setting 1: PolyKernel: testing several exponents. {1, 2, 5} Setting 2: RBF kernel: “grid-search" on C and γ using cross-validation.

C = {0.1, 1, 10}, γ = {0.1, 1, 10}

실험결과 (Weighted Avg. 기재) (아래표를채울것) Setting 1




1-1

1-2

1-3

C \ γ 0.1 1.0 10.0

0.1

1.0

10.0

Setting 2 (F1 Measure)

서울대학교바이오지능연구실

김병희


참고: 개요

Outline

Data Mining, Analytics, & Machine Learning

Weka를이용한 Classification

Weka의다양한인터페이스

Weka를이용한 Clustering

데이터전처리및가시화

Weka 추가정보및관련 S/W


DATA MINING, ANALYTICS, &MACHINE LEARNING

Part I


What Is Data Mining?

Data mining (knowledge discovery from data) Extraction of interesting patterns or knowledge from huge

amount of data ‘Interesting’ means: non-trivial, implicit, previously unknown

and potentially useful Alternative names

Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Watch out: Is everything “data mining”? Simple search and query processing (Deductive) expert systems

73Slide from Lecture Slide of Ch. 1 by J. Han, et al., for Data Mining: Concepts and Techniques(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr

데이터마이닝관련분야


Data Mining

Artificial Intelligence

(AI)Machine Learning

(ML)

Deep Learning

Data Science

Information Retrieval

(IR)

Knowledge Discovery from Data

(KDD)

Big Data

Analytics

Business Intelligence

74

애널리틱스(Analytics)

애널리틱스(Analytics, 분석솔루션) 데이터에서의미있는패턴을발견하고교류하는과정 데이터분석(data analysis)을포괄하는전체적인방법론을지칭 분야에따른용어활용

Text Analytics, Social Analytics, Business Analytics Predictive Analytics, Advanced Analytics, Cognitive Analytics

사례: Google 애널리틱스 웹로그기반마케팅성과추적및분석도구 Google 파트너 (from 2013.12)

Google 애널리틱스개인자격시험 (기존 Google 애널리틱스공인전문가 (IQ)


데이터마이닝, 기계학습기반 고급 기술에 대한수요가 최근 급격히 증가

Hype Cycle of Emerging Technologies 2010, Gartner

Analytics as Mainstream Technology


Analytics as Mainstream Technology


Machine Learning & Data Mining

(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 78Slide from GECCO 2009 Tutorial on ‘Large Scale Data Mining usingGenetics-Based Machine Learning’, by Jaume Bacardit and Xavier Llorà

Human Learning & Machine Learning


• Gather data

데이터 수집

• Preprocessing• Feature selection

전처리, 특성값 선택

• Features-correct labels combinations

‘특성값-정답 레이블’ 조합

R. Elwell and R. Polikar, “Incremental learning of concept drift in nonstationary environments,” IEEE Trans. Neural Netw., vol. 22, no. 10, pp. 1517–31, Oct. 2011.

Scaffolding: tutoring theory to enhance human learning

(passive)Supervised Learning

• Knowledge become available

지식 축적

• Complexity reduction

지식 표현 복잡도 경감/정제

• Experience - consequence combinations

‘경험-결과’ 조합 제공

예) 한여름뭉게구름 -소나기

Data Science & Machine Learning


데이터분석(data analysis)의과정

문제정의(define the question) 데이터준비(dataset)

이상적인데이터셋정의(define the ideal data set) 수집할데이터결정(determine what data you can access) 데이터수집(obtain the data) 데이터정리(clean the data)

탐색적데이터분석(exploratory data analysis) 클러스터링/데이터가시화(Clustering / Data visualization)

통계적예측/모델링(statistical prediction/modeling) 분류/예측(Classification / Prediction)

결과해석(interpret results) 다양한평가척도및방법(evaluation), 학습결과모델선정(model selection)

모든과정및결과에대한이의제기및점검(challenge results) 결과정리및보고서작성(synthesize/write up results) 결과재현가능한프로그램작성(create reproducible code)


J. Leek, Data Analysis – Structure of a Data Analysis, Lecture at Coursera, 2013

분석목적에따른데이터셋선택

기술적(descriptive) 전체모수필요(a whole population)

탐색적(exploratory) 무작위추출후다양한변수측정(a random sample with many variables

measured) 추론적(inferential)

모집단을정확히선별후무작위추출(the right population, randomly sampled)

예측적(predictive) 동일한모집단에서학습데이터와테스트데이터획득(a training and test

data set from the same population) 인과적(causal)

무작위적기법을적용한연구에서데이터획득(data from a randomized study)

기계론적(mechanistic) 시스템의모든요소를아우르는데이터획득(data about all components of

the system)(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 82

J. Leek, Data Analysis – Structure of a Data Analysis, Lecture at Coursera, 2013

예제: 사진기반자동판별기

무엇을판별할것인가?

어떤측정값을기준으로판별할것인가?

측정자료수집및분석

자동판별기계 ‘학습’ – 테스트 – 출시(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 83

SalmonSea Bass

Rockfish성별 판별

생선 종류 판별

감정 판별

사진출처: http://jamja.tistory.com/1705

해법: 기계학습 - 감독학습 - 분류기법

무엇을판별할것인가 Class label

어떤측정값을기준으로판별할것인가? 특성값(feature, attribute), 변수(variable)

측정자료수집및정리 Dataset data matrix

자동판별기계학습 – 테스트 – 출시 Training dataset Test dataset Classification model Evaluation & Model selection


y

𝑥𝑥1, 𝑥𝑥2, 𝑥𝑥3, …

(𝑋𝑋,𝑦𝑦)

𝑿𝑿 𝒚𝒚

Training set

Test set

features

instances

label

분류기법의역할


x1

x2

x1

x2

Binary classification: Multi-class classification:

A. Ng, Machine Learning, Lecture at Coursera, 2013

용어 정리

특성값(Features, Attributes, or Variables) Features are the individual measurable properties of the phenomena

being observed Choosing discriminating and independent features is key to any pattern

recognition algorithm being successful in classification

학습데이터 / 테스트데이터 (Training set / Test set) 학습데이터(Training set): A set of examples used for learning, that

is to fit the parameters [i.e., weights] of the classifier 테스트데이터(Test set): A set of examples used only to assess the

performance [generalization] of a fully-specified classifier


WEKA를이용한

CLASSIFICATION

Part Ⅱ


Weka 소개

대표적인기계학습알고리즘모음, 데이터마이닝도구Weka의주요기능

데이터전처리(data pre-processing), 특성값선별(feature selection) 군집화(clustering), 가시화(visualization) 분류(classification), 회귀분석(regression), 시계열예측(forecast) 연관규칙학습(association rules)

S/W 특성 무료및소스공개소프트웨어(free & open source GNU General Public License) 주요 analytics S/W의모체가됨: RapidMiner, MOA Java로구현. 다양한플랫폼에서실행가능

다운로드 Google에서Weka로검색, 첫번째검색결과 http://www.cs.waikato.ac.nz/ml/weka/

(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 88Weka (bird): http://www.arkive.org/weka/gallirallus-australis/video-au00.html

http://www.cs.waikato.ac.nz/ml/weka/

http://www.arkive.org/weka/gallirallus-australis/video-au00.html

Components of Data Mining

•Knowledge representation

•Tables, trees, rules, clusters, …

•Evaluating what’s been learned

•Training and testing•Performance •Comparing algorithms

•Inferring rules•Statistical modeling•Divide-and-conquer•Association •Linear models•…

•Concepts, Instances, Attributes

•Preparing the input

Input Algorithm

OutputCredibility


Weka as a Must-Have Tool


I use Weka constantly in my speech work. It is the first thing a reach for when encountering a new problem. What a terrific tool.

A must for anyone even marginally interested in machine learning and classification techniques.

One of the most useful AI software packages available. It's only serious flaw is being infected with the GNU virus.

Reviews in Sourceforge.net

Weka를구성하는인터페이스


Explorer: 다양한분석작업을한단계씩분석수행및결과확인가능. 일반적으로, 가장먼저실행

KnowledgeFlow: 데이터처리과정의주요모듈을그래프로가시화하여구성하고실험수행

Experimenter: 분류및회귀분석을일괄처리. 결과비교분석.- 다양한알고리즘및파라미터설정- 여러데이터-알고리즘조합동시분석- 분석모델간통계적비교- 대규모통계적실험수행

Simple CLI: 다른인터페이스를컨트롤하는스크립트입력창. Weka의모든기능을명령어로수행가능

그 외 주요 도구

Weka 실습구성

분류문제바로풀어보기 문제: 붓꽃종류판별 목표: 예측적 핵심과정: 통계적예측/모델링, 결과해석 도구: Weka의 Explorer, Experimenter

분류문제를더잘이해하고풀기위한사전작업 목표: 탐색적 핵심과정: 탐색적데이터분석 주요작업

데이터전처리 요인별분류기여도평가및선별 데이터군집화 데이터가시화

도구: Weka의 Explorer(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 92

실습: 붓꽃(iris)분류(classification)


Iris virginicaIris versicolorIris setosa

분류에 사용할 특성값









실습: 붓꽃(iris)분류(classification)

특성값정의(Define features or attributes) Sepal length, sepal width, petal length, petal width 분류라벨(Class label): 붓꽃(iris)의세아종. Setosa, versicolor, 및 virginica

샘플수집및데이터구성 각붓꽃아종별로 50개씩샘플수집 (1935년) Data table : 150 samples (or instances) * 5 attributes R. Fisher 경은 1936년발표논문에서이데이터에 linear discriminant model 을적용함

학습: 분류알고리즘선택및파라미터설정 세가지분류알고리즘으로실습: 신경망, 결정트리, SVM 각알고리즘별파라미터설정은기본값을적용

학습결과평가및모델선정 다양한평가척도확인 평가척도를기준으로학습결과모델(algorithm + parameter setting) 비교및선정


http://en.wikipedia.org/wiki/Linear_discriminant_analysis

95

분류알고리즘 – 신경망(Neural Networks)

MLP (Multilayer Perceptron) 실용적으로매우폭넓게쓰이는대표적분류알고리즘 Weka에서찾아가기: classifiers-functions-MultilayerPerceptron


Figure from Andrew Ng’s Machine Learning Lecture Notes, on Coursera, 2013-1

96

분류알고리즘 – 결정트리(Decision Trees)

J48 (C4.5의 Java 구현버전)

학습결과모델에서분류규칙을 ‘트리’ 형태로얻을수있음 Weka에서찾아가기: classifiers-trees-J48


분류알고리즘 – Support Vector Machines

SMO (sequential minimal optimization) for training SVM

Kernel Machine 기반의대표적인분류알고리즘 Weka에서찾아가기: classifiers-functions-SMO


실습: 붓꽃(iris) 데이터셋

Just open “iris.arff” in ‘data’ folder


Weka 데이터 양식 (.ARFF)

@RELATION iris@ATTRIBUTE sepallength REAL@ATTRIBUTE sepalwidth real@ATTRIBUTE petallength NUMERIC@ATTRIBUTE petalwidth numeric @ATTRIBUTE class {Iris-setosa, Iris-versicolor, Iris-virginica}@DATA

5.1, 3.5, 1.4, 0.2, Iris-setosa4.9, 3.0, 1.4, 0.2, Iris-setosa4.7, 3.2, 1.3, 0.2, Iris-setosa…7.0, 3.2, 4.7, 1.4, Iris-versicolor6.4, 3.2, 4.5, 1.5, Iris-versicolor6.9, 3.1, 4.9, 1.5, Iris-versicolor…

데이터(CSV format)

헤더

99

Note: Excel을 이용하여 CSV 파일 생성 후, 헤더만 추가하면 쉽게 arff 포맷의 파일 생성 가능


Dataset name Attribute name Attribute type

ARFF Example

100

%% ARFF file for weather data with some numeric features%@relation weather

@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {true, false}@attribute play? {yes, no}

@datasunny, 85, 85, false, nosunny, 80, 90, true, noovercast, 83, 86, false, yes...

Slide from Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr

Preprocess 탭에서탐색적데이터분석수행

데이터구성(current relation) 특성값삭제(remove attributes) 특성값별기초적통계분석(selected attribute) 모든특성값을대상으로 class label 분포가시화(Visualize All) ‘Filter’를이용한 preprocessing: Part V에서설명



102






103

분류 알고리즘의 파라미터 설정

파라미터설정(Parameter Setting) = 자동차튜닝(Car Tuning) 많은경험또는시행착오필요 파라미터설정에따라동일한알고리즘에서도최악에서최고의성능을모두보일수도있다

신경망의주요파라미터 (MultilayerPerceptron in Weka) 구조관련: hiddenLayers, 학습과정관련: learningRate, momentum, trainingTime (epoch), seed

결정트리의주요파라미터 (J48 in Weka) unpruned, numFolds, minNumObj 트리의크기에직접적용향: confidenceFactor, pruning 등

Support Vector Machine (SVM)의주요파라미터 (SMO in Weka) 커널(kernel) 관련: kernel 선택, kernel별추가의파라미터 최적화관련: c (complexity parameter)


Test Options and Classifier Output

104

There are various metrics for evaluation

Setting the data set used for evaluation


105

Evaluation Method - Cross Validation

K-fold Cross Validation The data set is randomly divided into k subsets. One of the k subsets is used as the ‘test set’ and the

other k-1 subsets are put together to form a ‘training set’.

30 3030 30 30D1 D2 D3 D4 D5

30D6

30 3030 30 30D1 D2 D3 D4 D6

30D5

30 3030 30 30D2 D3 D4 D5 D6

30D1

∑=

=k

iiError

kError

1

1

예: 6-fold cross validation: 180개의 데이터를 6등분 후, 6회 학습/평가 수행하여 평균 성능 측정


Classifier Output

Run information

Classifier model (full training set)

Evaluation results General summary Detailed accuracy by

class Confusion matrix


The output depends on the classifier

107

How to Evaluate the Performance? (1/2)

Usually, build a ‘Confusion Matrix’ on the test data set

Evaluation Metrics Accuracy (percent correct) Precision / Recall Various metrics: F-measure, Kappa score, etc.

For fare evaluation, the ‘cross-validation’ scheme is used


108

How to Evaluate the Performance? (2/2)

Confusion Matrix (binary class case)Real

Prediction Positive Negative

Positive TP FPAll with positive

Test

Negative FN TNAll with

Negative Test

All with Disease

All without Disease Everyone

FNTNFPTPTNTP

++++

=Accuracy

FNTPTP+

= RecallFPTP

TP+

=Precision

As recall ↑ precision ↓conversely:

As recall ↓ precision ↑


WEKA의다양한인터페이스Interfaces of Weka

Part Ⅲ




110

click

• Set experiment type/iteration control

• Set datasets / algorithms

Click ‘New’


‘Analyse’ tab and see the summary


Usages of the Experimenter

Model selection for classification/regression Various approaches

Repeated training/test set split Repeated cross-validation (c.f. double cross-validation) Averaging

Comparison between models / algorithms Paired t-test On various metrics: accuracies / RMSE / etc.

Batch and/or Distributed processing Load/save experiment settings http://weka.wikispaces.com/Remote+Experiment Multi-core support : utilize all the cores on a multi-core machine


http://weka.wikispaces.com/Remote+Experiment

Experimenter 실습


Experimenter 실습


Experimenter 실습


KnowledgeFlow for Analysis Process Design

115

(‘Process Flow Diagram’ of SAS® Enterprise Miner )


KNIME

RapidMiner

KnowledgeFlow: Example Usage

Decision tree (J48)





Command Line Interface (CLI)

Example command and result java weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V

0 -S 0 -E 20 -H a -t "C:\Program Files\Weka-3-7\data\iris.arff"


You may build command line scriptsfor various batch experiments easily

Refer Ch.1 of WekaManual-3-*-*.pdf for further information

Package Manager

Explorer에다양한기능추가가능


WEKA를이용한 CLUSTERINGPart Ⅳ


Motivating Questions for Clustering

What is the natural groupings in a set of data?


School EmployeesSimpson's Family MalesFemales

The way of grouping is not unique

Issues in Clustering

How should one measure the similarity between samples? (similarity measure)

How many clusters would be there? (number of clusters)

How should one evaluate a partitioning of a set of samples into clusters? (criterion function) E.g. high intra-class similarity and low inter-class

similarity


Purposes of Clustering

Quick review of data

Clustering before classification Checking if the given instances form separated clusters

well

Discovery Possible number of class labels Subclasses Groups of features Modules (feature / instance combinations)


126

k-means Clustering


127

Hierarchical Clustering


Resulting dendrograms with the original data matrix

Self-organizing Map (SOM)


World poverty map(http://www.cis.hut.fi/research/som-research/worldmap.html)

http://www.cis.hut.fi/research/som-research/worldmap.html

K-means in Weka

129


• ‘ARFF’ or ‘CSV’ formats are readible

• Click ‘Cluster’ tab• Click ‘Choose’ button• Select ‘weka–clusterers- SimpleKMeans


• Click ‘SimpleKMeans’ • Set distanceFunction• Set other parameters• Check ‘Classes to cluster ~’• Click Start

130

Evaluation of Clustering Results

Contingency Table Used when we have labels for instances Matching resulting clusters with labels


Categories of Clustering Algorithms

Hierarchical clustering Agglomerative / divisive Concept clustering

Partitional clustering K-means clustering Fuzzy c-means clustering Graph-theoretic clustering methods

Spectral clustering

Subspace clustering Co-clustering, biclustering


데이터전처리및가시화Data Preprocessing &Visualization

Part Ⅴ


Data Preprocessing with Filter in Weka

Attribute Selection, discretize

Instance Re-sampling, selecting specified folds



Descriptive analysis Scatter plot & correlation analysis among features



Unsupervised Learning Results Dimension reduction Clustering


Dendrogram(hierarchical clustering)

cluster assignments


Classification 학습결과모델

tree, graph boundary

모델의종합적성능평가: ROC curve (threshold curve) cost curve

Classifier errors


trees.j48의예 bayes.BayesNet - TAN의예

WEKA 추가정보및관련 S/W


More Information on Weka

Current version (April, 2016) Stable version: 3.8.0 Developer version: 3.9.0

Collections of datasets in Weka (ARFF) format http://www.cs.waikato.ac.nz/ml/weka/datasets.html Datasets from UCI repository Datasets from UCI KDD repository …


http://www.cs.waikato.ac.nz/ml/weka/datasets.html

Weka References

Weka Wiki: http://weka.wikispaces.com/ Primer: good starting point

Weka online documentation: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html

Textbook Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining:

Practical Machine Learning Tools and Techniques (Third Edition), Morgan Kaufmann, Jan. 2011.

Articles Data mining with WEKA, Part 1, Part 2, Part 3 in IBM

Technical Library Weka를이용한예측프로그램만들기 –월간마소연재(2009 7,8,9월호) 블로그, MS Live


http://weka.wikispaces.com/

http://weka.wikispaces.com/Primer

http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html

http://www.ibm.com/developerworks/opensource/library/os-weka1/index.html



http://freesearch.pe.kr/archives/tag/weka

https://skydrive.live.com/?cid=d97db53e9d36cde4&id=D97DB53E9D36CDE4!163

Other ML Open Source S/W’s

RapidMiner The most used analytics software in 2013 & 2014 (KDNuggets Poll) Basic interface for analyses is in the style of KnowlegeFlow in Weka Provides integrated environment for ML, DM, predictive analytics http://rapidminer.com/

MOA (Massive Online Analysis) Closely related project to the WEKA project Open source framework for data stream mining http://moa.cms.waikato.ac.nz/

KNIME Konstanz Information Miner modular data pipelining concept http://www.knime.org/


ELKI Environment for DeveLoping

KDD-Applications Supported by Index-Structure

KDD S/W framework + database http://elki.dbs.ifi.lmu.de/

http://rapidminer.com/

http://moa.cms.waikato.ac.nz/

http://www.knime.org/

http://elki.dbs.ifi.lmu.de/

Other ML Open Source S/W’s

Mahout http://mahout.apache.org/ Apache project to produce free implementations of distributed or

otherwise scalable machine learning algorithms Classification, clustering, and collaborative filtering, frequent itemset

mining Book: Mahout in Action

MLOSS http://mloss.org/ : forum for open source software in machine learning http://jmlr.org/mloss/ : JMLR Machine Learning Open Source

Software (MLOSS)


http://mahout.apache.org/

http://www.amazon.com/Mahout-Action-Sean-Owen/dp/1935182684/ref=sr_1_1?s=books&ie=UTF8&qid=1378757481&sr=1-1&keywords=Mahout+in+Action

http://mloss.org/

http://jmlr.org/mloss/

ANY QUESTION?


Documents

Machine Learning Practice using Weka · 2016-06-28 · Weka. 소개. 대표적인기계학습알고리즘모음, 데이터마이닝도구 . Weka. 의주요기능. . 데이터전처리(data