Upload
others
View
16
Download
1
Embed Size (px)
Citation preview
서울대학교바이오지능연구실
김병희
Machine Learning Practice using Weka
2016-06-27
Outline
실습 1: Weka와친숙해지기
실습 2: Weka 기반기계학습익히기
참고자료: Data Mining & Machine Learning
using Weka
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 2
김병희
실습 #1
Machine Learning Practice using Weka
Outline
Weka Explorer를이용한전처리및기초분석 Filter, Visualize Dataset: diabetes
Weka Explorer를이용한 Classification Dataset: Iris, diabetes Classifier: 결정트리(ID3, J48, SimpleCart), Random Forest
Weka Experimenter를이용한일괄분석 Dataset: diabetes T-test를이용한모델성능비교
실습수업제출결과물안내
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 4
실습 1 구성
분류문제바로풀어보기 문제: 붓꽃종류판별, 당뇨병진단 목표: 예측적 핵심과정: 통계적예측/모델링, 결과해석 도구: Weka의 Explorer, Experimenter
분류문제를더잘이해하고풀기위한사전작업 목표: 탐색적 핵심과정: 탐색적데이터분석 주요작업
데이터전처리 (요인별분류기여도평가및선별) 데이터가시화
도구: Weka의 Explorer
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 5
Weka 소개
대표적인기계학습알고리즘모음, 데이터마이닝도구Weka의주요기능
데이터전처리(data pre-processing), 특성값선별(feature selection) 군집화(clustering), 가시화(visualization) 분류(classification), 회귀분석(regression), 시계열예측(forecast) 연관규칙학습(association rules)
S/W 특성 무료및소스공개소프트웨어(free & open source GNU General Public License) 주요 analytics S/W의모체가됨: RapidMiner, MOA Java로구현. 다양한플랫폼에서실행가능 Python, C, R, Matlab 등주요도구와연동가능
다운로드 Google에서Weka로검색, 첫번째검색결과 http://www.cs.waikato.ac.nz/ml/weka/
(c)2008-2016, SNU Biointelligence Lab. 6Weka (bird): http://www.arkive.org/weka/gallirallus-australis/video-au00.html
Top 20 Most Popular Tools for Big Data, Data Mining, and Data Science
(c)2008-2016, SNU Biointelligence Lab. 7Source: http://www.kdnuggets.com/2015/06/data-mining-data-science-tools-associations.html
Red: Free/Open Source toolsGreen: Commercial toolsFuchsia: Hadoop/Big Data tools
Weka를구성하는인터페이스
(c)2008-2016, SNU Biointelligence Lab. 8
Explorer: 다양한분석작업을한단계씩분석수행및결과확인가능. 일반적으로, 가장먼저실행
KnowledgeFlow: 데이터처리과정의주요모듈을그래프로가시화하여구성하고실험수행
Experimenter: 분류및회귀분석을일괄처리. 결과비교분석.- 다양한알고리즘및파라미터설정- 여러데이터-알고리즘조합동시분석- 분석모델간통계적비교- 대규모통계적실험수행
Simple CLI: 다른인터페이스를컨트롤하는스크립트입력창. Weka의모든기능을명령어로수행가능
그 외 주요 도구
Workbench: 다른인터페이스를통합한통합인터페이스 (3.8.0부터등장)
사례: 붓꽃(iris)분류(classification)
(c)2008-2016, SNU Biointelligence Lab. 9
Iris virginicaIris versicolorIris setosa
분류에 사용할 특성값
사례: 붓꽃(iris)분류(classification)
특성값정의(Define features or attributes) Sepal length, sepal width, petal length, petal width 분류라벨(Class label): 붓꽃(iris)의세아종. Setosa, versicolor, 및 virginica
샘플수집및데이터구성 각붓꽃아종별로 50개씩샘플수집 (1935년) Data table : 150 samples (or instances) * 5 attributes R. Fisher 경은 1936년발표논문에서이데이터에 linear discriminant model 을적용함
학습: 분류알고리즘선택및파라미터설정 세가지분류알고리즘으로실습: 신경망, 결정트리, SVM 각알고리즘별파라미터설정은기본값을적용
학습결과평가및모델선정 다양한평가척도확인 평가척도를기준으로학습결과모델(algorithm + parameter setting) 비교및선정
(c)2008-2016, SNU Biointelligence Lab. 10
실습: 붓꽃(iris) 데이터셋
Just open “iris.arff” in ‘data’ folder
(c)2008-2016, SNU Biointelligence Lab. 11
Weka 데이터 양식 (.ARFF)
@RELATION iris@ATTRIBUTE sepallength REAL@ATTRIBUTE sepalwidth real@ATTRIBUTE petallength NUMERIC@ATTRIBUTE petalwidth numeric @ATTRIBUTE class {Iris-setosa, Iris-versicolor, Iris-virginica}@DATA
5.1, 3.5, 1.4, 0.2, Iris-setosa4.9, 3.0, 1.4, 0.2, Iris-setosa4.7, 3.2, 1.3, 0.2, Iris-setosa…7.0, 3.2, 4.7, 1.4, Iris-versicolor6.4, 3.2, 4.5, 1.5, Iris-versicolor6.9, 3.1, 4.9, 1.5, Iris-versicolor…
데이터(CSV format)
헤더
12Note: Excel을 이용하여 CSV 파일 생성 후, 헤더만 추가하면 쉽게 arff 포맷의 파일 생성 가능
(c)2008-2016, SNU Biointelligence Lab.
Dataset name Attribute name Attribute type
ARFF Example
13
%% ARFF file for weather data with some numeric features%@relation weather
@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {true, false}@attribute play? {yes, no}
@datasunny, 85, 85, false, nosunny, 80, 90, true, noovercast, 83, 86, false, yes...
Slide from Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)(c)2008-2016, SNU Biointelligence Lab.
Preprocess 탭에서탐색적데이터분석수행
데이터구성(current relation) 특성값삭제(remove attributes) 특성값별기초적통계분석(selected attribute) 모든특성값을대상으로 class label 분포가시화(Visualize All) ‘Filter’를이용한 preprocessing
(c)2008-2016, SNU Biointelligence Lab. 14
예측 모델 학습 및 평가 과정 예시
?
? ?
?
?
Data matrix(row: instance)(col: feature)
Feature Selection
PCA
SVM
Decision Tree
Neural Networks
Accuracy +
Cross-validation
ROC
Curve
AUC
(Area Under ROC
Curve)
feature
inst
ance
normalization
Dataset준비/Cleaning
FeatureManipulation
ClassificationRegression Evaluation
Fill missing values
standardization
문제 해결을 위한 다양한 프로세스 구성 및 테스트
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 15
실습: 피마인디안당뇨병진단
Description Pima Indians have the highest prevalence of diabetes in the world We will build classification models that diagnose if the patient shows signs of
diabetes http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
Configuration of the data set 768 instances 8 attributes
age, number of times pregnant, results of medical tests/analysis all numeric (integer or real-valued)
Class label = 1 (Positive example ) Interpreted as "tested positive for diabetes" 268 instances
Class label = 0 (Negative example) 500 instances
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 16
WEKA를이용한데이터전처리
및기술적분석(DESCRIPTIVE ANALYSIS)
Part Ⅰ
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 17
Preprocess 탭에서기술적분석수행
데이터구성(current relation) 특성값삭제(remove attributes) 특성값별기초적통계분석(selected attribute) 모든특성값을대상으로 class label 분포가시화(Visualize All) ‘Filter’를이용한 preprocessing: Part V에서설명
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 18
Weka - Explorer – Preprocess –Filter를이용한데이터전처리
주요연습대상전처리기 Fill in missing values
weka.filters.unsupervised.attribute.ReplaceMissingValues이용 Standardization for all the attributes: x를 z 로변환
weka.filters.unsupervised.attribute.Standardize이용 Data reduction using PCA
weka.filters.unsupervised.attribute.PrincipalComponents이용 파라미터중 maximumAttributes를 10~50 사이중하나의숫자로임의로설정
Check the effect of PCA using ‘Visualize-Plot Matrix’ 다음장
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 19
Weka - Explorer – Visualize 를이용한기술적분석 (descriptive analysis)
Check the effect of PCA using ‘Visualize-Plot Matrix’ PCA 적용전/후두 plot matrix를대조하여 PCA의효과를확인하는작업
PCA 적용전(401차원데이터) 및후(주성분의수만큼차원축소된데이터)에대한 Plot Matrix 화면을각각캡처하고, 비교및해석을수행
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 20
WEKA EXPLORER를이용한
CLASSIFICATION
Part Ⅱ
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 21
22
분류알고리즘 – 결정트리(Decision Trees)
J48 (C4.5의 Java 구현버전)
학습결과모델에서분류규칙을 ‘트리’ 형태로얻을수있음 Weka에서찾아가기: classifiers-trees-J48
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
23
비교: 신경망(Neural Networks)
MLP (Multilayer Perceptron) 실용적으로매우폭넓게쓰이는대표적분류알고리즘 Weka에서찾아가기: classifiers-functions-MultilayerPerceptron
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
Figure from Andrew Ng’s Machine Learning Lecture Notes, on Coursera, 2013-1
Weka로분류수행하기 – 신경망의예
24
click • load a file that contains the training data by clicking ‘Open file’ button
• ‘ARFF’ or ‘CSV’ formats are readable
• Click ‘Classify’ tab• Click ‘Choose’ button• Select ‘weka – function - MultilayerPerceptron
• Click ‘MultilayerPerceptron’ • Set parameters for MLP• Set parameters for Test• Click ‘Start’ for learning
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
25
분류 알고리즘의 파라미터 설정
파라미터설정(Parameter Setting) = 자동차튜닝(Car Tuning) 많은경험또는시행착오필요 파라미터설정에따라동일한알고리즘에서도최악에서최고의성능을모두보일수도있다
결정트리의주요파라미터 (J48, SimpleCart in Weka) 트리의크기에직접적영향을주는파라미터: confidenceFactor, pruning,
minNumObj 등
Random Forest의주요파라미터 (RandomForest in Weka) numTrees: 학습및예측에참여할 tree의수를지정. 대체로많을수록좋으나, overfitting에주의해야한다.
참고: 신경망의주요파라미터 (MultilayerPerceptron in Weka) 구조관련: hiddenLayers, 학습과정관련: learningRate, momentum, trainingTime (epoch), seed
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
Test Options and Classifier Output
26
There are various metrics for evaluation
Setting the data set used for evaluation
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
Classifier Output
Run information
Classifier model (full training set)
Evaluation results General summary Detailed accuracy by
class Confusion matrix
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 27
The output depends on the classifier
WEKA EXPERIMENTER를
이용한 CLASSIFICATION
Part Ⅲ
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 28
Using Experimenter in Weka
Tool for ‘Batch’ experiments
29
click
• Set experiment type/iteration control
• Set datasets / algorithms
Click ‘New’
• Select ‘Run’ tab and click ‘Start’• If it has finished successfully, click
‘Analyse’ tab and see the summary
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
Usages of the Experimenter
Model selection for classification/regression Various approaches
Repeated training/test set split Repeated cross-validation (c.f. double cross-validation) Averaging
Comparison between models / algorithms Paired t-test On various metrics: accuracies / RMSE / etc.
Batch and/or Distributed processing Load/save experiment settings http://weka.wikispaces.com/Remote+Experiment Multi-core support : utilize all the cores on a multi-core machine
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 30
Experimenter 실습
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 31
Experimenter 실습
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 32
Experimenter 실습
실험결과화면예시(Analyse 탭에서 1~4 순서대로선택)
(C) 2014-2015, B.-H Kim 33
• Accuracy: percent_correct 선택• F1-measure: F_measure 선택• ROC Area: Area Under ROC 선택
1
2
3
4
참고: Package Manager
Explorer에다양한기능추가가능
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 34
참고: 데이터가시화Data Visualization
Part Ⅳ
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 35
가시화 (Visualization)
Descriptive analysis Scatter plot & correlation analysis among features
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 36
가시화 (Visualization)
Classification 학습결과모델
tree, graph boundary
모델의종합적성능평가: ROC curve (threshold curve) cost curve
Classifier errors
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 37
trees.j48의예 bayes.BayesNet - TAN의예
실습결과정리양식
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 38
문제및데이터셋구성 (슬라이드 1장)
분류문제 (문제의명칭및간단한설명)
데이터셋구성 크기
# instances = ? # attributes = ?
Attribute 목록 (이름및 attribute의 type을아래에간략히기재)
Training/test set 구성 (둘중하나를선택하여작성)
선택 1: training set 66%, test set 34% (random seed = *) 선택 2: k-fold cross validation
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 39
전처리및분류알고리즘 (슬라이드 1장)
전처리 (적용한전처리기명칭과의도를간단히기재. 두가지이상가능)
분류알고리즘: 결정트리 ID3 J48 (C 4.5의자바버전) CART (classification and regression tree) Random Forest
테스트한프로세스및핵심파라미터설정내역 작성예 1: discretization – ID3 작성예 2: PCA (5) – J48 / CART 작성예 3: Random Forest, # trees = 5,10,20,30
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 40
학습결과 (슬라이드 1장)
각결정트리별다음을요약정리 분류성능
Accuracy = ? Precision = ? , Recall = ? ROC Area = ?
Confusion Matrix
결과분석, 논의(간략하게)
(여기에학습결과결정트리를복사하여붙여넣기. 1~2개)
(random forest를선택한경우는, 학습결과 trees 캡처는하지말고, #trees에따른성능변화에초점을두고분석)
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 41
Experimenter를이용한일괄처리 – 설정(슬라이드 1장)
문제: Pima Indians diabetes(Experimenter의실험설정최종화면을캡쳐하여여기에넣을것)
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 42
Experimenter를이용한일괄처리 – 결과(슬라이드 1장)
Analysis 결과화면중에서성능비교표와인덱스를캡쳐하여여기에넣을것
결과에대해요약, 분석할것
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 43
요약, 결론, 실습소감, 질문 (슬라이드 1장)
실습요약
그외소감이나질문이있으면이슬라이드내에서자유롭게정리
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 44
김병희
실습 #2
Machine Learning Practice using Weka
실습 2 구성
주요 Classifier 연습 Ensemble methods ANN, SVM
Weka 익숙하게다루기 Explorer Meta tools in weka.classifiers Experimenter
연습
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 46
Practice Outline
Ensemble Methods Boosting + Decision stump vs. J48 vs. Random Forest Boosting의주요파라미터설정에따른결과비교
주요 Classifier 연습: 파라미터설정 ANN: learning rate, momentum, #iteration SVM: PolyKernel (exponent), RBF kernel (gamma + C) Experimenter로 TAN/ANN/SVM/kNN수행및비교
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 47
Dataset: Spam Classification
Description Many email services today provide
spam filters that are able to classify emails into spam and non-spam email
You will be training a classifier to classify whether a given email, x, is spam or non-spam
Configuration of the data set 1899 terms to check spams All terms are binary which means the
term exists or not 1899 binary attributes Binary class label 4000 emails in Training set 1000 emails in Test set
(C) 2014, SNU Biointelligence Lab, http://bi.snu.ac.kr/ 48
Preprocessing and Normalization steps which were applied to the dataset
Training / test set 설정방법
smapTrain.arff를 Preprocess – Open 메뉴로열기
(C) 2014-2015, B.-H Kim 49
Training / test set 설정방법
smapTest.arff를 Classify – Test options – Supplied test set 메뉴에서열기
(C) 2014-2015, B.-H Kim 50
이창을그대로띄워두거나close를눌러서닫아도됨
Training / test set 설정방법
이제 classifier를선정하고학습을수행하면, training set을기준으로모델을학습하고, test set에대한평가결과가출력된다.
(C) 2014-2015, B.-H Kim 51
Dataset: 필기체 숫자 인식 (MNIST)
Description The MNIST database of handwritten digits contains digits written by office
workers and students We will build a recognition model based on classifiers with the reduced set of
MNIST http://yann.lecun.com/exdb/mnist/
Configuration of the data set For our practice, we use a ‘subset’ of the MNIST set
Full MNIST set contains 60,000 training and10,000 test samples
500 samples are randomly sampled for this practice Attributes
pixel values in gray level in a 20x20 image 400 attributes (real-valued)
Class attribute: 0~9, which represent digits from 0 to 9
52(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
실습 – Feature Manipulation Attribute Selection 또는 PCA를이용한차원축소를적용한후 5가지이상 classification 모델로분류시도
분류결과요약및 3가지이상척도를기준을적용하여최종모델 1건선정
(C) 2014-2015, B.-H Kim 53
Attribute Selection 적용후분류하기 Classifier로 meta-AttributeSelectedClassifier 선택
Classifier 선택 Evaluator로 InfoGainAttributeEval또는
GainRatioAttributeEval을선택하고, search 방법으로‘Ranker’를선택
Test options에서 ‘supplied test set’ 선택하고, spamTest.arff를로드
PCA 적용후분류하기 Classifier로meta-FilteredClassifier적용
Classifier 선택 Filter에서 PrincipalComponents 적용
Test options에서 ‘supplied test set’ 선택하고, spamTest.arff를로드
실습 – 다양한분류프로세스적용
분류실험설정요약보고 프로세스 5가지이상의설정에대한요약보고
차원축소알고리즘최소 2가지적용(예: InfoGain, PCA) 서로다른 Classifier 최소 2가지적용(예: J48, NaiveBayes)
글로정리하거나, 작성예: Attribute Selection: InfoGain – Classifier: J48
우측그림과같이다이어그램을그릴것
이후각 [차원축소-classifier]조합별로별도의섹션을구성하고, 다음을기재 차원축소알고리즘및파라미터설정내역 Classifier 알고리즘및파라미터설정내역
실험결과요약 5가지이상의설정에대해다음의분류결과성능을표로정리
Accuracy, precision, recall, F-measure, ROC Area 최적의설정에대한보고: [차원축소-classifier] 조합및각각의파라미터설정을
기재
(C) 2014-2015, B.-H Kim 54
실습 - 주의사항
주의 1: 수행 중 중단되는 경우, weka 실행을 위한 java heap 메모리를보다 크게 설정할 것.
Weka가 설치된 폴더에 있는 ‘RunWeka.ini’ 파일에서 메모장등을 이용해 ‘maxheap’ 항목을 다음과 같이 수정: maxheap=1024M (또는 2048M)
주의 2: 데이터가 큰 만큼 대부분의 작업에 수행에시간이 상당히 소요됨 미리 Weka를 돌려보기 시작할 것 Classification 작업의 경우 Experimenter를 활용하는 것이 한가지 방법
(C) 2014-2015, B.-H Kim 55
WEKA를이용한
ENSEMBLE LEARNING
Practice Ⅰ
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 56
Boosting: AdaBoost.M1, RealAdaBoost
Boosting in Weka AdaBoost.M1 for multiclass problem 참고: RealAdaBoost for binary lass problem (Weka 3.7.9 이상에서추가설치가능) Several boosting tools in the ‘meta’ category
Major Options for boosting classifier: select a classifier. Usually, a ‘weak learner’ is enough numIterations: usually, the longer, the better. shrinkage: 1.0 means no shrinkage in the step size. Usually set around 0.1.
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 57
대표적 CLASSIFIER 연습
ANN & SVM
Practice Ⅱ
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 58
Weka로분류수행하기 – 신경망의예
59
click • load a file that contains the training data by clicking ‘Open file’ button
• ‘ARFF’ or ‘CSV’ formats are readable
• Click ‘Classify’ tab• Click ‘Choose’ button• Select ‘weka – function - MultilayerPerceptron
• Click ‘MultilayerPerceptron’ • Set parameters for MLP• Set parameters for Test• Click ‘Start’ for learning
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
인공신경망 (ANN) 연습
ANN의주요파라미터설명(functions-MultilayerPerceptron기준) learningRate -- The amount the weights are updated. momentum -- Momentum applied to the weights during updating. hiddenLayers –
This defines the hidden layers of the neural network. This is a list of positive whole numbers. 1 for each hidden layer. Comma seperated.
Ex) 3: one hidden layer with 3 hidden nodes Ex) 5,3; two hidden layers with 5 and 3 hidden nodes, respectively
To have no hidden layers put a single 0 here. This will only be used if autobuild is set. There are also wildcard values 'a' = (attribs + classes) / 2, 'i' = attribs, 'o' = classes , 't' = attribs + classes.
trainingTime -- The number of epochs to train through. If the validation set is non-zero then it can terminate the network early
Experiments 실습목표: 주요파라미터의효과를확인하기위해극단적인값을설정하고성능
평가수행
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 60
SVM 연습
SVM의주요파라미터설명(functions-SMO 기준) c -- The complexity parameter C. kernel -- The kernel to use.
PolyKernel -- The polynomial kernel : K(x, y) = <x, y>^p or K(x, y) = (<x, y>+1)^p.
“exponent” represents p in the equations. RBFKernel -- K(x, y) = e^-(gamma * <x-y, x-y>^2)
gamma (γ) controls the width (range of neighborhood) of the kernel
Experiments 실습목표: 대표적커널연습. 커널별핵심파라미터설정연습 PolyKernel: testing several exponents. {1, 2, 5} RBF kernel: “grid-search" on C and γ using cross-validation.
C = {0.1, 1, 10}, γ = {0.1, 1, 10}
Reference A practical guide to SVM classification (http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf)
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 61
참고자료
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 62
Using Experimenter in Weka
Tool for ‘Batch’ experiments
63
click
• Results Destination: set the log file that records prediction results
• Set Experiment Type/Iteration Control• Set Datasets / Algorithms
Click ‘New’
• Select ‘Run’ tab and click ‘Start’• If it has finished successfully, click
‘Analyse’ tab, get experimental logs, and see the summary of model comparison
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
참고: Package Manager
Explorer에다양한기능추가가능(Weka 3.7 이상)
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 64
참고: More TIPS on ‘meta’ Tools
Class label이 ordinal이라면 Meta-OrdinalClassClassifier를살펴보자
예) Abalone 데이터: 나이의 10단계로구분하기 예) HeartDisease데이터: 심장병의 5단계구분
Outlier 탐지, 또는 Positive 탐지만이명확하게의미가부여되는경우 Meta-OneClassClassifier를살펴보자
Cross-validation을이용한최적의파라미터탐색을빠르게하고싶다면 Meta-CVParameterSelection을살펴보자
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 65
실습결과정리양식
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 66
실습 1: Ensemble of Decision Trees
실험설정(각알고리즘에서 default setting 적용) Dataset: MNIST_handwritten_digit_subset_500.arff Test option: 5-fold cross validation Setting 1-1, 1-2: AdaBoostM1, classifier로 Decision Stump/J48 지정 Setting 2: J48 Setting 3: Random Forest
실험결과 (Weighted Avg. 기재) (아래표를채울것)
분석및논의 (두항목이상기재)
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 67
Setting Accuracy F1 Measure ROC Area
1-1
1-2
2
3
실습 2: 인공신경망 (ANN) 연습
실험설정(각설정별지정한 parameter 외의값은 default 적용) Dataset: MNIST_handwritten_digit_subset_500.arff Test option: 5-fold cross validation Classifier: functions - MultilayerPerceptron Parameters
실험결과 (Weighted Avg. 기재) (아래표를채울것)
분석및논의 (두항목이상기재)
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 68
Setting Accuracy F1 Measure ROC Area
1
2
3
Setting hiddenLayers learningRate Momentum trainingTime
1 10 0.3 0.2 50
2 10 0.9 0.5 50
3 10, 7 0.3 0.2 50
실습 3: SVM 연습
실험설정(각설정별테스트할 parameter 외의값은 default 적용) Dataset: MNIST_handwritten_digit_subset_500.arff Test option: 5-fold cross validation Classifier: functions - SMO Setting 1: PolyKernel: testing several exponents. {1, 2, 5} Setting 2: RBF kernel: “grid-search" on C and γ using cross-validation.
C = {0.1, 1, 10}, γ = {0.1, 1, 10}
실험결과 (Weighted Avg. 기재) (아래표를채울것) Setting 1
분석및논의 (두항목이상기재)
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 69
Setting Accuracy F1 Measure ROC Area
1-1
1-2
1-3
C \ γ 0.1 1.0 10.0
0.1
1.0
10.0
Setting 2 (F1 Measure)
서울대학교바이오지능연구실
김병희
Machine Learning Practice using Weka
참고: 개요
Outline
Data Mining, Analytics, & Machine Learning
Weka를이용한 Classification
Weka의다양한인터페이스
Weka를이용한 Clustering
데이터전처리및가시화
Weka 추가정보및관련 S/W
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 71
DATA MINING, ANALYTICS, &MACHINE LEARNING
Part I
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 72
What Is Data Mining?
Data mining (knowledge discovery from data) Extraction of interesting patterns or knowledge from huge
amount of data ‘Interesting’ means: non-trivial, implicit, previously unknown
and potentially useful Alternative names
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”? Simple search and query processing (Deductive) expert systems
73Slide from Lecture Slide of Ch. 1 by J. Han, et al., for Data Mining: Concepts and Techniques(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
데이터마이닝관련분야
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
Data Mining
Artificial Intelligence
(AI)Machine Learning
(ML)
Deep Learning
Data Science
Information Retrieval
(IR)
Knowledge Discovery from Data
(KDD)
Big Data
Analytics
Business Intelligence
74
애널리틱스(Analytics)
애널리틱스(Analytics, 분석솔루션) 데이터에서의미있는패턴을발견하고교류하는과정 데이터분석(data analysis)을포괄하는전체적인방법론을지칭 분야에따른용어활용
Text Analytics, Social Analytics, Business Analytics Predictive Analytics, Advanced Analytics, Cognitive Analytics
사례: Google 애널리틱스 웹로그기반마케팅성과추적및분석도구 Google 파트너 (from 2013.12)
Google 애널리틱스개인자격시험 (기존 Google 애널리틱스공인전문가 (IQ)
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 75
데이터마이닝, 기계학습기반 고급 기술에 대한수요가 최근 급격히 증가
Hype Cycle of Emerging Technologies 2010, Gartner
Analytics as Mainstream Technology
76(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
Analytics as Mainstream Technology
77(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
Machine Learning & Data Mining
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 78Slide from GECCO 2009 Tutorial on ‘Large Scale Data Mining usingGenetics-Based Machine Learning’, by Jaume Bacardit and Xavier Llorà
Human Learning & Machine Learning
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 79
• Gather data
데이터 수집
• Preprocessing• Feature selection
전처리, 특성값 선택
• Features-correct labels combinations
‘특성값-정답 레이블’ 조합
R. Elwell and R. Polikar, “Incremental learning of concept drift in nonstationary environments,” IEEE Trans. Neural Netw., vol. 22, no. 10, pp. 1517–31, Oct. 2011.
Scaffolding: tutoring theory to enhance human learning
(passive)Supervised Learning
• Knowledge become available
지식 축적
• Complexity reduction
지식 표현 복잡도 경감/정제
• Experience - consequence combinations
‘경험-결과’ 조합 제공
예) 한여름뭉게구름 -소나기
Data Science & Machine Learning
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 80
데이터분석(data analysis)의과정
문제정의(define the question) 데이터준비(dataset)
이상적인데이터셋정의(define the ideal data set) 수집할데이터결정(determine what data you can access) 데이터수집(obtain the data) 데이터정리(clean the data)
탐색적데이터분석(exploratory data analysis) 클러스터링/데이터가시화(Clustering / Data visualization)
통계적예측/모델링(statistical prediction/modeling) 분류/예측(Classification / Prediction)
결과해석(interpret results) 다양한평가척도및방법(evaluation), 학습결과모델선정(model selection)
모든과정및결과에대한이의제기및점검(challenge results) 결과정리및보고서작성(synthesize/write up results) 결과재현가능한프로그램작성(create reproducible code)
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 81
J. Leek, Data Analysis – Structure of a Data Analysis, Lecture at Coursera, 2013
분석목적에따른데이터셋선택
기술적(descriptive) 전체모수필요(a whole population)
탐색적(exploratory) 무작위추출후다양한변수측정(a random sample with many variables
measured) 추론적(inferential)
모집단을정확히선별후무작위추출(the right population, randomly sampled)
예측적(predictive) 동일한모집단에서학습데이터와테스트데이터획득(a training and test
data set from the same population) 인과적(causal)
무작위적기법을적용한연구에서데이터획득(data from a randomized study)
기계론적(mechanistic) 시스템의모든요소를아우르는데이터획득(data about all components of
the system)(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 82
J. Leek, Data Analysis – Structure of a Data Analysis, Lecture at Coursera, 2013
예제: 사진기반자동판별기
무엇을판별할것인가?
어떤측정값을기준으로판별할것인가?
측정자료수집및분석
자동판별기계 ‘학습’ – 테스트 – 출시(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 83
SalmonSea Bass
Rockfish성별 판별
생선 종류 판별
감정 판별
사진출처: http://jamja.tistory.com/1705
해법: 기계학습 - 감독학습 - 분류기법
무엇을판별할것인가 Class label
어떤측정값을기준으로판별할것인가? 특성값(feature, attribute), 변수(variable)
측정자료수집및정리 Dataset data matrix
자동판별기계학습 – 테스트 – 출시 Training dataset Test dataset Classification model Evaluation & Model selection
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 84
y
𝑥𝑥1, 𝑥𝑥2, 𝑥𝑥3, …
(𝑋𝑋,𝑦𝑦)
𝑿𝑿 𝒚𝒚
Training set
Test set
features
instances
label
분류기법의역할
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 85
x1
x2
x1
x2
Binary classification: Multi-class classification:
A. Ng, Machine Learning, Lecture at Coursera, 2013
용어 정리
특성값(Features, Attributes, or Variables) Features are the individual measurable properties of the phenomena
being observed Choosing discriminating and independent features is key to any pattern
recognition algorithm being successful in classification
학습데이터 / 테스트데이터 (Training set / Test set) 학습데이터(Training set): A set of examples used for learning, that
is to fit the parameters [i.e., weights] of the classifier 테스트데이터(Test set): A set of examples used only to assess the
performance [generalization] of a fully-specified classifier
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 86
WEKA를이용한
CLASSIFICATION
Part Ⅱ
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 87
Weka 소개
대표적인기계학습알고리즘모음, 데이터마이닝도구Weka의주요기능
데이터전처리(data pre-processing), 특성값선별(feature selection) 군집화(clustering), 가시화(visualization) 분류(classification), 회귀분석(regression), 시계열예측(forecast) 연관규칙학습(association rules)
S/W 특성 무료및소스공개소프트웨어(free & open source GNU General Public License) 주요 analytics S/W의모체가됨: RapidMiner, MOA Java로구현. 다양한플랫폼에서실행가능
다운로드 Google에서Weka로검색, 첫번째검색결과 http://www.cs.waikato.ac.nz/ml/weka/
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 88Weka (bird): http://www.arkive.org/weka/gallirallus-australis/video-au00.html
Components of Data Mining
•Knowledge representation
•Tables, trees, rules, clusters, …
•Evaluating what’s been learned
•Training and testing•Performance •Comparing algorithms
•Inferring rules•Statistical modeling•Divide-and-conquer•Association •Linear models•…
•Concepts, Instances, Attributes
•Preparing the input
Input Algorithm
OutputCredibility
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 89
Weka as a Must-Have Tool
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 90
I use Weka constantly in my speech work. It is the first thing a reach for when encountering a new problem. What a terrific tool.
A must for anyone even marginally interested in machine learning and classification techniques.
One of the most useful AI software packages available. It's only serious flaw is being infected with the GNU virus.
Reviews in Sourceforge.net
Weka를구성하는인터페이스
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 91
Explorer: 다양한분석작업을한단계씩분석수행및결과확인가능. 일반적으로, 가장먼저실행
KnowledgeFlow: 데이터처리과정의주요모듈을그래프로가시화하여구성하고실험수행
Experimenter: 분류및회귀분석을일괄처리. 결과비교분석.- 다양한알고리즘및파라미터설정- 여러데이터-알고리즘조합동시분석- 분석모델간통계적비교- 대규모통계적실험수행
Simple CLI: 다른인터페이스를컨트롤하는스크립트입력창. Weka의모든기능을명령어로수행가능
그 외 주요 도구
Weka 실습구성
분류문제바로풀어보기 문제: 붓꽃종류판별 목표: 예측적 핵심과정: 통계적예측/모델링, 결과해석 도구: Weka의 Explorer, Experimenter
분류문제를더잘이해하고풀기위한사전작업 목표: 탐색적 핵심과정: 탐색적데이터분석 주요작업
데이터전처리 요인별분류기여도평가및선별 데이터군집화 데이터가시화
도구: Weka의 Explorer(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 92
실습: 붓꽃(iris)분류(classification)
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 93
Iris virginicaIris versicolorIris setosa
분류에 사용할 특성값
실습: 붓꽃(iris)분류(classification)
특성값정의(Define features or attributes) Sepal length, sepal width, petal length, petal width 분류라벨(Class label): 붓꽃(iris)의세아종. Setosa, versicolor, 및 virginica
샘플수집및데이터구성 각붓꽃아종별로 50개씩샘플수집 (1935년) Data table : 150 samples (or instances) * 5 attributes R. Fisher 경은 1936년발표논문에서이데이터에 linear discriminant model 을적용함
학습: 분류알고리즘선택및파라미터설정 세가지분류알고리즘으로실습: 신경망, 결정트리, SVM 각알고리즘별파라미터설정은기본값을적용
학습결과평가및모델선정 다양한평가척도확인 평가척도를기준으로학습결과모델(algorithm + parameter setting) 비교및선정
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 94
95
분류알고리즘 – 신경망(Neural Networks)
MLP (Multilayer Perceptron) 실용적으로매우폭넓게쓰이는대표적분류알고리즘 Weka에서찾아가기: classifiers-functions-MultilayerPerceptron
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
Figure from Andrew Ng’s Machine Learning Lecture Notes, on Coursera, 2013-1
96
분류알고리즘 – 결정트리(Decision Trees)
J48 (C4.5의 Java 구현버전)
학습결과모델에서분류규칙을 ‘트리’ 형태로얻을수있음 Weka에서찾아가기: classifiers-trees-J48
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
분류알고리즘 – Support Vector Machines
SMO (sequential minimal optimization) for training SVM
Kernel Machine 기반의대표적인분류알고리즘 Weka에서찾아가기: classifiers-functions-SMO
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 97
실습: 붓꽃(iris) 데이터셋
Just open “iris.arff” in ‘data’ folder
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 98
Weka 데이터 양식 (.ARFF)
@RELATION iris@ATTRIBUTE sepallength REAL@ATTRIBUTE sepalwidth real@ATTRIBUTE petallength NUMERIC@ATTRIBUTE petalwidth numeric @ATTRIBUTE class {Iris-setosa, Iris-versicolor, Iris-virginica}@DATA
5.1, 3.5, 1.4, 0.2, Iris-setosa4.9, 3.0, 1.4, 0.2, Iris-setosa4.7, 3.2, 1.3, 0.2, Iris-setosa…7.0, 3.2, 4.7, 1.4, Iris-versicolor6.4, 3.2, 4.5, 1.5, Iris-versicolor6.9, 3.1, 4.9, 1.5, Iris-versicolor…
데이터(CSV format)
헤더
99
Note: Excel을 이용하여 CSV 파일 생성 후, 헤더만 추가하면 쉽게 arff 포맷의 파일 생성 가능
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
Dataset name Attribute name Attribute type
ARFF Example
100
%% ARFF file for weather data with some numeric features%@relation weather
@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {true, false}@attribute play? {yes, no}
@datasunny, 85, 85, false, nosunny, 80, 90, true, noovercast, 83, 86, false, yes...
Slide from Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2)(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
Preprocess 탭에서탐색적데이터분석수행
데이터구성(current relation) 특성값삭제(remove attributes) 특성값별기초적통계분석(selected attribute) 모든특성값을대상으로 class label 분포가시화(Visualize All) ‘Filter’를이용한 preprocessing: Part V에서설명
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 101
Weka로분류수행하기 – 신경망의예
102
click • load a file that contains the training data by clicking ‘Open file’ button
• ‘ARFF’ or ‘CSV’ formats are readable
• Click ‘Classify’ tab• Click ‘Choose’ button• Select ‘weka – function - MultilayerPerceptron
• Click ‘MultilayerPerceptron’ • Set parameters for MLP• Set parameters for Test• Click ‘Start’ for learning
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
103
분류 알고리즘의 파라미터 설정
파라미터설정(Parameter Setting) = 자동차튜닝(Car Tuning) 많은경험또는시행착오필요 파라미터설정에따라동일한알고리즘에서도최악에서최고의성능을모두보일수도있다
신경망의주요파라미터 (MultilayerPerceptron in Weka) 구조관련: hiddenLayers, 학습과정관련: learningRate, momentum, trainingTime (epoch), seed
결정트리의주요파라미터 (J48 in Weka) unpruned, numFolds, minNumObj 트리의크기에직접적용향: confidenceFactor, pruning 등
Support Vector Machine (SVM)의주요파라미터 (SMO in Weka) 커널(kernel) 관련: kernel 선택, kernel별추가의파라미터 최적화관련: c (complexity parameter)
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
Test Options and Classifier Output
104
There are various metrics for evaluation
Setting the data set used for evaluation
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
105
Evaluation Method - Cross Validation
K-fold Cross Validation The data set is randomly divided into k subsets. One of the k subsets is used as the ‘test set’ and the
other k-1 subsets are put together to form a ‘training set’.
30 3030 30 30D1 D2 D3 D4 D5
30D6
30 3030 30 30D1 D2 D3 D4 D6
30D5
30 3030 30 30D2 D3 D4 D5 D6
30D1
∑=
=k
iiError
kError
1
1
예: 6-fold cross validation: 180개의 데이터를 6등분 후, 6회 학습/평가 수행하여 평균 성능 측정
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
Classifier Output
Run information
Classifier model (full training set)
Evaluation results General summary Detailed accuracy by
class Confusion matrix
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 106
The output depends on the classifier
107
How to Evaluate the Performance? (1/2)
Usually, build a ‘Confusion Matrix’ on the test data set
Evaluation Metrics Accuracy (percent correct) Precision / Recall Various metrics: F-measure, Kappa score, etc.
For fare evaluation, the ‘cross-validation’ scheme is used
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
108
How to Evaluate the Performance? (2/2)
Confusion Matrix (binary class case)Real
Prediction Positive Negative
Positive TP FPAll with positive
Test
Negative FN TNAll with
Negative Test
All with Disease
All without Disease Everyone
FNTNFPTPTNTP
++++
=Accuracy
FNTPTP+
= RecallFPTP
TP+
=Precision
As recall ↑ precision ↓conversely:
As recall ↓ precision ↑
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
WEKA의다양한인터페이스Interfaces of Weka
Part Ⅲ
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 109
Using Experimenter in Weka
Tool for ‘Batch’ experiments
110
click
• Set experiment type/iteration control
• Set datasets / algorithms
Click ‘New’
• Select ‘Run’ tab and click ‘Start’• If it has finished successfully, click
‘Analyse’ tab and see the summary
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
Usages of the Experimenter
Model selection for classification/regression Various approaches
Repeated training/test set split Repeated cross-validation (c.f. double cross-validation) Averaging
Comparison between models / algorithms Paired t-test On various metrics: accuracies / RMSE / etc.
Batch and/or Distributed processing Load/save experiment settings http://weka.wikispaces.com/Remote+Experiment Multi-core support : utilize all the cores on a multi-core machine
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 111
Experimenter 실습
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 112
Experimenter 실습
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 113
Experimenter 실습
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 114
KnowledgeFlow for Analysis Process Design
115
(‘Process Flow Diagram’ of SAS® Enterprise Miner )
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
KNIME
RapidMiner
KnowledgeFlow: Example Usage
Decision tree (J48)
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 116
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 117
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 118
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 119
Command Line Interface (CLI)
Example command and result java weka.classifiers.functions.MultilayerPerceptron -L 0.3 -M 0.2 -N 500 -V
0 -S 0 -E 20 -H a -t "C:\Program Files\Weka-3-7\data\iris.arff"
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 120
You may build command line scriptsfor various batch experiments easily
Refer Ch.1 of WekaManual-3-*-*.pdf for further information
Package Manager
Explorer에다양한기능추가가능
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 121
WEKA를이용한 CLUSTERINGPart Ⅳ
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 122
Motivating Questions for Clustering
What is the natural groupings in a set of data?
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 123
School EmployeesSimpson's Family MalesFemales
The way of grouping is not unique
Issues in Clustering
How should one measure the similarity between samples? (similarity measure)
How many clusters would be there? (number of clusters)
How should one evaluate a partitioning of a set of samples into clusters? (criterion function) E.g. high intra-class similarity and low inter-class
similarity
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 124
Purposes of Clustering
Quick review of data
Clustering before classification Checking if the given instances form separated clusters
well
Discovery Possible number of class labels Subclasses Groups of features Modules (feature / instance combinations)
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 125
126
k-means Clustering
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
127
Hierarchical Clustering
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
Resulting dendrograms with the original data matrix
Self-organizing Map (SOM)
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 128
World poverty map(http://www.cis.hut.fi/research/som-research/worldmap.html)
K-means in Weka
129
click • load a file that contains the training data by clicking ‘Open file’ button
• ‘ARFF’ or ‘CSV’ formats are readible
• Click ‘Cluster’ tab• Click ‘Choose’ button• Select ‘weka–clusterers- SimpleKMeans
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
• Click ‘SimpleKMeans’ • Set distanceFunction• Set other parameters• Check ‘Classes to cluster ~’• Click Start
130
Evaluation of Clustering Results
Contingency Table Used when we have labels for instances Matching resulting clusters with labels
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
Categories of Clustering Algorithms
Hierarchical clustering Agglomerative / divisive Concept clustering
Partitional clustering K-means clustering Fuzzy c-means clustering Graph-theoretic clustering methods
Spectral clustering
Subspace clustering Co-clustering, biclustering
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 131
데이터전처리및가시화Data Preprocessing &Visualization
Part Ⅴ
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 132
Data Preprocessing with Filter in Weka
Attribute Selection, discretize
Instance Re-sampling, selecting specified folds
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 133
가시화 (Visualization)
Descriptive analysis Scatter plot & correlation analysis among features
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 134
가시화 (Visualization)
Unsupervised Learning Results Dimension reduction Clustering
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 135
Dendrogram(hierarchical clustering)
cluster assignments
가시화 (Visualization)
Classification 학습결과모델
tree, graph boundary
모델의종합적성능평가: ROC curve (threshold curve) cost curve
Classifier errors
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 136
trees.j48의예 bayes.BayesNet - TAN의예
WEKA 추가정보및관련 S/W
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 137
More Information on Weka
Current version (April, 2016) Stable version: 3.8.0 Developer version: 3.9.0
Collections of datasets in Weka (ARFF) format http://www.cs.waikato.ac.nz/ml/weka/datasets.html Datasets from UCI repository Datasets from UCI KDD repository …
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 138
Weka References
Weka Wiki: http://weka.wikispaces.com/ Primer: good starting point
Weka online documentation: http://www.cs.waikato.ac.nz/ml/weka/index_documentation.html
Textbook Ian H. Witten, Eibe Frank, Mark A. Hall, Data Mining:
Practical Machine Learning Tools and Techniques (Third Edition), Morgan Kaufmann, Jan. 2011.
Articles Data mining with WEKA, Part 1, Part 2, Part 3 in IBM
Technical Library Weka를이용한예측프로그램만들기 –월간마소연재(2009 7,8,9월호) 블로그, MS Live
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 139
Other ML Open Source S/W’s
RapidMiner The most used analytics software in 2013 & 2014 (KDNuggets Poll) Basic interface for analyses is in the style of KnowlegeFlow in Weka Provides integrated environment for ML, DM, predictive analytics http://rapidminer.com/
MOA (Massive Online Analysis) Closely related project to the WEKA project Open source framework for data stream mining http://moa.cms.waikato.ac.nz/
KNIME Konstanz Information Miner modular data pipelining concept http://www.knime.org/
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 140
ELKI Environment for DeveLoping
KDD-Applications Supported by Index-Structure
KDD S/W framework + database http://elki.dbs.ifi.lmu.de/
Other ML Open Source S/W’s
Mahout http://mahout.apache.org/ Apache project to produce free implementations of distributed or
otherwise scalable machine learning algorithms Classification, clustering, and collaborative filtering, frequent itemset
mining Book: Mahout in Action
MLOSS http://mloss.org/ : forum for open source software in machine learning http://jmlr.org/mloss/ : JMLR Machine Learning Open Source
Software (MLOSS)
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 141
ANY QUESTION?
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 142