DATA ANALYTICS164.115.41.179/d756/sites/default/files/Data Analytics.pdf · 2018. 12. 25. ·...

DATA ANALYTICS

Wanida Saetang Ph.D (candidate)King Mongkut's University of Technology North Bangkok

Agenda

Data Analytics Predictive analytics

Data Mining Techniques Decision Tree K-means

Apply Model & Validation Model

Rapid Miner Studio

Workshop

Data Analytics predictive analytics

The Progression of Analytics

Predictive Analytics

carried out in an attempt to determine the outcome of an event that might occur in the future.

the models used for predictive analytics have implicit dependencies on the conditions under which the past events occurred.

Data Mining Techniques Decision tree

K-Means

CRISP-DM

http://mlwiki.org/index.php/CRISP-DM

ประเภทของขอมล

ประเภทขอมล

(type of data)

เชงปรมาณ(Numerical/

Quantitative Data)

Discreat Data

Continuous Data

เชงคณภาพ(Category/

Qualitative Data)

Nominal Data

Ordinal Data

ขอมลทไดจากการนบ เชน จ านวนลกคา

ชอมลทไดจากการวด เชน น าหนก, สวนสง

ขอมลทแบงออกเปนกลมๆ ไมสามารถน ามาค านวณได เชน เพศ

ขอมลทแบงออกเปนกลมๆ สามารถบอกล าดบของกลมได เชน ระดบการศกษา

เทคนคเหมองขอมล

Decision Tree

Naive Bayes

Neural Network

Support Vector Machines (SVM)

K-Means

DBSCAN

EM Clustering using GMMs.

Agglomerative Hierarchical

Apriori algorithm

Eclat algorithm

FP-growth algorithm

ClassificationDecision Tree

Decision Tree

ตนไมตดสนใจ (decision tree) เปนการจ าแนกกลมโดยททราบจ านวนกลมปลายทาง เปาหมายของการจ าแนก คอ ท านายคา หรอตวแปรเปาหมาย (class/label) ตนไมตดสนใจเปนเหมอนกราฟ หรอแผนผง มลกษณะเปนตนไมกลบหว ประกอบดวย Node (โหนด) โดยแตละโหนด แทนตวแปรอนพต (input attribute) ตาง ๆ ในชดขอมล และEdge (เสนเชอม) แทนคาของตวแปร (numerical attributes) โหนดบนสดเรยกวา root node หรอโหนดราก และแตกกงออกมาเปน leaf node หรอโหนดใบ

Decision Tree: Information Gain

ข นตอนการสราง decision tree จะท าการค านวณเลอกแอตทรบวตทมความสมพนธกบคลาสมาใชงาน คา Information Gain สามารถค านวณไดจากสมการ ดานลางน

Information Gain = Entropy(initial) – [P(c1) × Entropy(c1) + P(c2) × Entropy(c2) + …]

โดยท Entropy(c1) = –P(c1) log 2P(c1)

และ P(c1) คอ คาความนาจะเปน (probability) ของ c1

Decision Tree: spam e-mail classification

ID Type

1 spam

2 spam

5 spam

6 spam

8 spam

3 normal

4 normal

7 normal

9 normal

10 normal

P(spam) = 5/10 = 0.5P(normal) = 5/10 = 0.5Entropy (initial) = - [P(spam) × log2 P(spam) + P(normal) × log2 P(normal)]

Entropy(initial) = - [0.5 x log2 (0.5) + 0.5 x log2 (0.5)]= - [0.5 x (-1) + 0.5 x (-1) ]= 1

ID Free Type

1 Y spam

5 Y spam

6 Y spam

2 N spam

3 N normal

4 N normal

7 N normal

8 N spam

9 N normal

10 N normal

P(spam) = 3/3 = 1.0P(normal) = 0/3 = 0.0Entropy(Free = Y) = -[1.0 x log2 (1.0) + 0.0 x log2 (0.0)]

= -[1.0 x 0 + 0.0 x 0 ]= 0

P(spam) = 2/7 = 0.29P(normal) = 5/7 = 0.71Entropy(Free = N) = -[0.29 x log2 (0.29) + 0.71 x log2 (0.71)]

= -[0.29 x (-1.79) + 0.71 x (-0.49) ]= 0.87

Information Gain (Free) = Entropy(initial) – [P (Free = Y) × Entropy(Free = Y) + P(Free = N) × Entropy(Free = N) ]= 1 – [0.3 × 0 + 0.7 × 0.87]= 0.39

สรางโมเดล (Classification model)

http://dataminingtrend.com/2014/data-mining-techniques/ensemble-model/

classification model

ClusteringK-means

Clustering การท า Clustering คอ การแบงกลมหรอจดกลมขอมล โดยไมทราบจ านวนกลมปลายทาง

ขอมลทมลกษณะคลาย ๆ กน จะอยกลมเดยวกน ขอมลทมลกษณะทแตกตางกนมาก ๆ จะถกจดใหอยคนละกลมกน โดยแตละกลมจะเรยกวา คลสเตอร (cluster)

คลสเตอร A

คลสเตอร B

คลสเตอร C

Clustering

การจดขอมลใหอยในกลมตาง ๆ จะตองมการวดคาความคลายคลง (similarity) หรอคาระยะหาง (distance) ระหวางขอมลแตละตว (example)

วธการค านวณคาระยะหางทนยมใช เชน ระยะหางยคลเดยน (Euclidean distance)

P1 (x1,y1)

P2 (x2,y2)

𝐶 = 𝑥1 − 𝑥22 + 𝑦1 − 𝑦2

Clustering

ในการท า Clustering มพารามเตอรทตองก าหนด คอ จ านวนกลมทตองการแบง หรอจ านวนคลสเตอร แทนดวยตวแปร K

ขนตอนการท างาน

1. เลอกจ านวนของคลสเตอร (K)

2. สมเลอกจดศนยกลาง (centroid) ข นมาตามจ านวนคลสเตอร

3. ก าหนดใหขอมลอยในคลสเตอรทใกลทสด

4. ค านวณหาจดศนยกลางแตละคลสเตอรใหม

5. ท าซ าขอ 3 และ 4 ซ า จนกระทง centroid ไมมการเปลยนแปลง

Apply ModelValidation Model

การประยกตใชโมเดล (Apply model)

http://dataminingtrend.com/2014/data-mining-techniques/ensemble-model/

สรางโมเดล

น าโมเดลไปใชงาน

Validate Model

Self Consistency Test

Split-validation

Cross-validation

http://dataminingtrend.com/2014/data-mining-techniques/cross-validation/

Repository

Operators

Process Parameters

Rapid Miner Studio

Input Ports (inp)example set (exa)training set (tra)

Output Ports (res)Output (out)model (mod) example set (exa)

Workshop Decision tree

K-Means

การเตรยมขอมล

Training Data Testing Data Unknown Data

Preprocessing

(cleansing)

สรางโมเดล

Modeling

ประยกตใชโมเดล

Apply model

ทดสอบโมเดล

Validation

วดประสทธภาพ

โมเดล

Performance

Blending Ex. Select attributes (เลอกคอลมน)Filter examples (เลอกแถว)

Cleansing Ex. Replace missing values (เตมขอมลทเปนmissing values ดวยคาอน)

ModelingEx. Decision tree, Random Forest, k-means, Rules Induction, Deep learning

ScoringEx. Apply model

ValidationEx. Performance classification, Cluster Distance Performance

ValidationEx. Cross validation, Split validation

Modeling Process

Workshop 1 Decision Tree

Validate Model

Test Model

Apply Model

Decision Tree

Workshop 2 K-means

• Lab1 K-means • Lab2 Apply Model

• Lab3 Test Model • Lab4 Validate Model

DATA ANALYTICS164.115.41.179/d756/sites/default/files/Data Analytics.pdf · 2018. 12. 25. ·...

Documents

Predictive Analyticscanworksmart.com/.../uploads/2014/06/CAN_Predictive-Analytics.pdf · Big Data & Predictive Analytics. . . . . . . . . . . .34 When to update your predictive models

Driving operational excellence with predictive analyticsbluelineplanning.com/.../uploads/2012/03/...with-Predictive-Analytics.pdf · Driving operational excellence with predictive

WEB ANALYTICS - Camera di Commercio Udine Analytics.pdf · 2014-11-13 · WEB ANALYTICS.PDF micgiucciaa SLIDE 2 2 La Web Analytics è la disciplina che studia le performance del Sito

KEPUTUSAN MENTERI KEUANGAN REPUBLIK INDONESIA …perpustakaan.bappenas.go.id/lontar/file?file=digital/156063-[_Konten_]-Konten D756.pdf · portofolio utang yang sesuai dengan perkembangan

IT securITy guide to security analytics - Bitpipedocs.media.bitpipe.com/.../CWE_BG_0414_security-analytics.pdf · computerweekly.com buyer’s guide 1 Home IT securITy analyTIcs:

Big Data Analytics using ScalaTion.cobweb.cs.uga.edu/~jam/scalation_guide/analytics.pdf · Chapter 1 Introduction to Analytics ScalaTion supports multi-paradigm modeling that can

Big Data Analytics-Sangtien - 164.115.41.179164.115.41.179/d756/sites/default/files/Big Data Analytics-Sangtien.pdfDiagnostic Analytics-วินิจฉัยถึงสาเหตุของการเกิดผลที่

Database & Technology 1 _ Craig Shallahamer _ Unit of work time based performance analytics.pdf

ANALYZE - ITA Dynamicsleanmfgs.com/media/Microsoft - NAV - Business Analytics.pdf · 3 BUSINESS ANALYTICS - TECHNICAL WHITE PAPER Introduction, Overview and Architecture Business

Installing BIRT Analytics - OpenTextotadocs.opentext.com/.../installing-birt-analytics.pdf · 2013. 6. 24. · iv Installing BIRT Analytics end-user desktop. Actuate’s cloud-ready

Dijkstra's Algorithm and FloydWarshall's Algorithm

Maximize Your Online Resultsoxygen.readyplanet.com/.../08-Google-Analytics.pdf · Google Analytics Workshop. Google Certified Professional ˙ ./0 ˙ ˜ ... Microsoft PowerPoint -

Big Data Analytics - 164.115.41.179164.115.41.179/d756/sites/default/files/Big Data Analytics2.pdf · Big Data Analytics Ph.D. (candidate) ICT for Education, KMUTNB Boromarajonani

การจัดท ามาตรฐานการจัดเก็บ ...164.115.41.179/d756/sites/default/files/การ... · 2018-02-22 · จดหมายเหตุ

Algorithm Cost Algorithm Complexity. Algorithm Cost

Shotest Path Algorithm Dijikstra’s Algorithm

Prim Algorithm and kruskal algorithm

SegmentSync for Adobe Analytics.pdf

Exhaustive Signature Algorithm Guy Harari. Outline ISA biclustering algorithm Bimax biclustering algorithm Exhaustive Signature Algorithm Results and

Advanced Algorithms: Text Algorithmssommer/aa10//aa11.pdf · Brute-force algorithm Knuth-Morris-Pratt algorithm Colussi algorithm Aho-Corasick algorithm Boyer-Moore algorithm Horspool