71
DIY Data Mining based 個 個個個個個個個個 Raymond, OSDC, 2014

A Concept of Network Analysis Tool by Data Mining

Embed Size (px)

DESCRIPTION

implementing a concept to observe network status by Data Mining, these methods include data description ,clustering, frequent pattern

Citation preview

Page 1: A Concept of Network Analysis Tool by Data Mining

DIY 一個 Data Mining based 的

網路流量分析工具

Raymond, OSDC, 2014

Page 2: A Concept of Network Analysis Tool by Data Mining

Raymond

http://systw.net

興趣 / 專長 / 研究 : network( 懂一點 Data Mining, 懂一點programing)

Page 3: A Concept of Network Analysis Tool by Data Mining

why?

為什麼要做這個事情 ?

Page 4: A Concept of Network Analysis Tool by Data Mining

首先

做為網管 , 了解網路狀況是我們的工作

Page 5: A Concept of Network Analysis Tool by Data Mining

本能 ?

想在網路流量中看到更多( 更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多更多 )資訊

Page 6: A Concept of Network Analysis Tool by Data Mining

工具回顧

回顧一下有那些工具可以看網路

Page 7: A Concept of Network Analysis Tool by Data Mining

MRTG

Page 8: A Concept of Network Analysis Tool by Data Mining

Monitorix

Page 9: A Concept of Network Analysis Tool by Data Mining
Page 10: A Concept of Network Analysis Tool by Data Mining

Cacti

Page 11: A Concept of Network Analysis Tool by Data Mining

ntop

Page 12: A Concept of Network Analysis Tool by Data Mining

sFlowTrend

Page 13: A Concept of Network Analysis Tool by Data Mining

nfsen

Page 14: A Concept of Network Analysis Tool by Data Mining

沒有資料分析 ( 用其他角度看資料 ) 的Tool

著重在資料呈現 /視覺化(然後讓管理者分析 )

先資料分析 (Data Mining)後在呈現 (管理者只要看分析後的結果 )

MRTG 有NTOP 有CACTI 有

sFlowTrend 有nfsen 有

???????? 有

我找了又找 , 還是沒找到

Page 15: A Concept of Network Analysis Tool by Data Mining

拋磚引玉

好吧那自己做一個 “概念性的工具”

希望不久的將來 , 可以看到有類似的工具可以直接使用

Page 16: A Concept of Network Analysis Tool by Data Mining

Let’s GO

動手 DIY 了

Page 17: A Concept of Network Analysis Tool by Data Mining

大致上的做法料理名稱: Data Mining 流量分析工具

1. 準備食材 ( 資料 )2. 準備工具3. 處理他們 ( 資料前處理 , 資料轉檔… )4. 煎煮炒炸 (Data Mining)5. 一點擺盤 ( 將結果排版一下 )

大功告成

Page 18: A Concept of Network Analysis Tool by Data Mining

食材介紹 ( 資料介紹 )

Flow 格式的 data ,大概長的像這樣

簡單來說 , 一筆記錄表示一個 flow

Page 19: A Concept of Network Analysis Tool by Data Mining

工具介紹

• Pmacct( 運送新鮮的食材 )• MySql ( 保存食材 )• PHP 和 Perl ( 切 , 削 ,… 處理食材 )• Mahout( 煎煮炒炸… .)• Bootstrap 和 ChartJS( 高級餐具 )

Page 20: A Concept of Network Analysis Tool by Data Mining

找一個網路設備 ( 找食材 )

NetFlow

sFlow

or

輸出到用來分析的機器啟用netflow 或 sflow

我是網路設備

我是用來分析的機器

Page 21: A Concept of Network Analysis Tool by Data Mining

運送新鮮食材

我是用來分析的機器

MySqlNetFlow or sFlow pmacct

在要分析的機器上把 pmacct 裝好flow 資料就會進到 mysql

Page 22: A Concept of Network Analysis Tool by Data Mining

檢查食材是否安全抵達pmacct 會將收到的 flow 丟到 acct_v8 這個table

我們看一下食材到了沒Select * from acct_v8 limit 1;

Page 23: A Concept of Network Analysis Tool by Data Mining

開始處理食材(退冰 , 清洗 ,… )我們要將以下原本的食材

處理像這樣

Page 24: A Concept of Network Analysis Tool by Data Mining

透過類似這樣的方法insert into10m_flowg_mem(addtime,srcaddr,dstaddr,prot,srcport,dstport,byte,pkt,flow)select stamp_inserted as addtime,ip_src as srcaddr,ip_dst as dstaddr,ip_proto as prot,CONCAT(ip_proto,port_src) as srcport,CONCAT(ip_proto,port_dst) as dstport,sum(bytes)as byte,sum(packets)as pkt,count(*)as flowFROM `'.$table.'` where stamp_inserted="'.$addtime.'"group by ip_src,ip_dst,ip_proto,port_src,port_dst;

Page 25: A Concept of Network Analysis Tool by Data Mining

切割食材我們要將剛才的食材,切成以下

Page 26: A Concept of Network Analysis Tool by Data Mining

透過類似這樣的方法insert into 10m_stateip_mem(addtime,addr,srcaddr,sdtsport,sdtdip,sdtdport,sbyte,spkt,sflow,sflowg,ddtsip,ddtsport,dstaddr,ddtdport,dbyte,dpkt,dflow,dflowg)select * from(((SELECT addtime,srcaddr AS addr FROM '.$table.' where addtime="'.$addtime.'" and srcaddr like "%'.$x.'") UNION (SELECT addtime,dstaddr AS addr FROM '.$table.' where addtime="'.$addtime.'" and dstaddr like "%'.$x.'"))as tipleft join(selectsrcaddr,count(distinct srcport)as sdtsport,count(distinct dstaddr)as sdtip,count(distinct dstport)as sdtdport,sum(byte)as sbyte,sum(pkt) as spkt,sum(flow) as sflow,count(*) as sflowgfrom '.$table.' where addtime="'.$addtime.'" and srcaddr like "%'.$x.'"group by srcaddr)as ts on addr=srcaddr)left join(selectcount(distinct srcaddr)as ddtsip,count(distinct srcport)as ddtsport,dstaddr,count(distinct dstport)as ddtdport,sum(byte)as dbyte,sum(pkt) as dpkt,sum(flow) as dflow,count(*) as dflowgfrom '.$table.' where addtime="'.$addtime.'" and dstaddr like "%'.$x.'"group by dstaddr)as td on addr=dstaddr

下載相關數據

上傳相關數據

Page 27: A Concept of Network Analysis Tool by Data Mining

小補充 : IP 通訊方式簡介

IP

(Dst) Byte(Dst) Packet(Dst) flow

(Src) Byte(Src) Packet

(Src) flow

Src Port Dst Port

IPDst Port Src Port

Page 28: A Concept of Network Analysis Tool by Data Mining

小補充 : IP 通訊方式簡介

IPSrc Port1 Dst Port1

IP 1Dst Port Src Port

IP 2Dst Port Src Port

Src Port2 Dst Port2

(src)Distinct dip

幾種 dst ip 被使用

(src)Distinct dport

幾種 dst port 被使用

(src)distinct sport

幾種 Src port 被使用

(dst)Distinct sip

(dst)Distinct sport

(dst)distinct dport

(Src) Byte(Src) Packet

(Src) flow

(Dst) Byte(Dst) Packet(Dst) flow

Page 29: A Concept of Network Analysis Tool by Data Mining

屬性介紹

(src)Distinct dip(src)Distinct dport(src)distinct sport

(Src) Byte(Src) Packet

(Src) flow

sdtdipsdtdportsdtsport

sbytespkt

sflow

(dst)Distinct sip(dst)Distinct sport(dst)distinct dport

(dst) Byte(dst) Packet

(dst) flow

ddtsipddtsportddtdport

dbytedpkt

dflow

該 IP 到幾種不同的目地 IP該 IP 到幾種不同的目地 Port該 IP 用幾種不同的來源 Port 到目地該 IP 用多少 byte 到目地 (upload)該 IP 用多少 packet 到目地該 IP 用多少 flow 到目地

幾種不同的來源 IP 到該 IP幾種不同的來源 Port 到該 IP來源到該 IP 用幾種不同的目地 PORT來源用多少 byte 到該 IP(download)來源用多少 packet 到該 IP來源多少 flow 到該 IP

Page 30: A Concept of Network Analysis Tool by Data Mining

透過 PHP 撈資料以及 bootstrap 的排版

你的資料集大概會長這樣

Page 31: A Concept of Network Analysis Tool by Data Mining
Page 32: A Concept of Network Analysis Tool by Data Mining

Data description

了解資料的概況

Page 33: A Concept of Network Analysis Tool by Data Mining

衡量資料位置Data description 中的其中一個方法

Q: 這可以做什麼 ?A: 用來找出 outlier

Page 34: A Concept of Network Analysis Tool by Data Mining

Z-score• 說明 : 一種衡量資料位置方法 , 可了解每個資料的分佈位置

• 應用 : 可用來找出 outlier 離異值 ( Z-score > 3 )

Anomaly

Photo http://zh.wikipedia.org/wiki/File:Standard_deviation_diagram.svg

Page 35: A Concept of Network Analysis Tool by Data Mining

用 Z-score 找 outlier IP

• 簡單來講假設今天有 100 個 IP 在這個網路 ,我們用這方法 ,就能發現這裡面幾個 ,很不一樣的 IP( 或稱為異常 IP)

Page 36: A Concept of Network Analysis Tool by Data Mining

( 苦工 ) 用 PHP 和 SQL 算1先將剛剛的屬性都正規化讓數字介於 0~1 之間 (min-max normalization)

2每個 IP 把自己的屬性加起來

3算出 threshold用統計的方法 , 算出標準差在 *3(Z-score>3)

4比大小若 IP 的值比 threshold 大 , 視為 outlier IP

Page 37: A Concept of Network Analysis Tool by Data Mining

Outlier IP

這些都是非常不正常的 IP

結果大概會像這樣

Page 38: A Concept of Network Analysis Tool by Data Mining

還可以做什麼之畫個圖吧

每 10鐘計算正常 IP 和 Outlier IP 的比例了解目前網路有怪咖的狀況 (Outlier IP)

透過 ChartJS, 我們大概可以畫成這樣

Page 39: A Concept of Network Analysis Tool by Data Mining
Page 40: A Concept of Network Analysis Tool by Data Mining

還可以做什麼之做個表吧

另外 ,也可以把

每個 outlier IP 發生 outlier 的時間和

每個 outlier IP 上網時間列出來做比較

大概會長這樣

Page 41: A Concept of Network Analysis Tool by Data Mining
Page 42: A Concept of Network Analysis Tool by Data Mining

Clustering

找出小團體

Page 43: A Concept of Network Analysis Tool by Data Mining

What is clustering?

假設這裡有 100人我可以根據一些特性 ( 個性 ,啫好 ,…等 )

將這些人分成幾個小組

clustering

Page 44: A Concept of Network Analysis Tool by Data Mining

Mahout 上場Canopy+kmeans= 自動分群 (最簡單的做法 )

clustering

剛剛的 IP 資料集會長這樣 可畫出一個正常群 , 和多個異常群

在巧妙的參數配合下

Page 45: A Concept of Network Analysis Tool by Data Mining

動手做mahout org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -t1 0.6 -t2 0.3 -x 10 -i ".$input." -o ".$output

我們在將 mahout 輸出結果處理一下( 資料轉換 )

透過 chartjs, 結果大概會長這樣

Page 46: A Concept of Network Analysis Tool by Data Mining

可以看到不同群有不同的長像

Page 47: A Concept of Network Analysis Tool by Data Mining

把每個群展開來看

Page 48: A Concept of Network Analysis Tool by Data Mining

還可以做什麼之 JSON 一下

• 剛剛的結果也可以把他們輸出成 json 格式

• 讓其他系統 (firewall,…)讀取這裡分析的結果

Page 49: A Concept of Network Analysis Tool by Data Mining

Frequent pattern

找同時發生的事件

Page 50: A Concept of Network Analysis Tool by Data Mining

What is frequent pattern

• 最有名的例子買尿布的通常也會買啤酒

當某個事情發生時另一個事情也會跟著發生

• 常見應用購物籃分析

Page 51: A Concept of Network Analysis Tool by Data Mining

所以我們也可以把這個例子應用在網路

當大家連到某個 IP時 ,通常也連那個 IP ?

Page 52: A Concept of Network Analysis Tool by Data Mining

動手做

我們先將以下這些資料

處理成以下這樣的資料

Page 53: A Concept of Network Analysis Tool by Data Mining

• 112.124.20.221

• 10.5.5.11 101.227.14.222 111.1.53.158 113.31.88.150 119.235.235.16 163.28.5.9 173.194.127.62 173.194.72.103 173.194.72.104 173.194.72.95 173.252.110.27 173.252.110.29 183.61.112.124 183.61.112.37 202.169.175.79 203.17.63.21 209.177.95.163 209.177.95.169 211.79.36.170 220.181.109.158 220.181.184.186 221.204.10.89 23.61.194.10 23.61.194.26 23.61.194.34 23.61.194.65 23.61.194.72 23.76.204.153 239.255.255.250 31.13.69.36 42.99.128.138 42.99.128.152 42.99.128.154 42.99.128.160 42.99.128.176 42.99.128.177 54.251.119.61 54.254.98.219 61.135.185.18 61.145.124.85 64.208.5.26 64.208.5.41 64.208.5.9 69.171.233.33 69.171.245.49 74.125.203.157 74.125.23.188 74.125.31.139 74.125.31.188

• 224.0.0.22 224.0.0.251

• 10.5.5.11 239.255.255.250

• 10.5.5.11 163.28.5.24 17.135.64.4 17.149.36.104 17.172.233.129 224.0.0.251 23.48.139.164

• 10.5.5.11 101.226.200.130 106.187.40.88 107.20.172.41 107.23.186.214 113.106.27.210 119.147.211.135 119.161.22.33 119.235.235.91 121.14.228.118 122.143.12.163 123.130.123.164 124.95.142.209 124.95.142.210 14.17.110.21 162.159.243.187 163.28.5.10 163.28.5.11 163.28.5.17 163.28.5.18 163.28.5.19 163.28.5.25 163.28.5.27 163.28.5.33 163.28.5.8 163.28.5.9 17.132.73.88 17.134.126.131 17.134.126.132 17.134.62.130 17.134.62.131 17.149.32.58 17.149.36.121 17.151.225.120 17.151.225.65 17.151.225.77 17.151.225.8 17.154.66.105 17.154.66.107 17.154.66.109 17.154.66.111 17.154.66.79 17.172.232.55 17.172.233.103 17.173.254.222 17.173.254.223 173.194.72.103 173.194.72.104 173.194.72.105 173.194.72.156 173.194.72.95 173.223.232.11 173.223.232.19 173.223.232.24 173.223.232.33 173.223.232.35 173.223.232.42 173.223.232.74 173.223.232.75 173.252.103.16 173.252.110.27 173.252.79.23 175.6.0.102 175.6.0.104 175.6.0.106 175.6.0.124 180.97.81.120 180.97.81.122 183.60.41.120 202.169.175.117 202.169.175.82 202.169.175.89 203.104.131.5 203.17.63.21 203.83.220.250 206.190.38.30 218.59.209.165 218.59.209.182 218.59.209.197 220.130.123.76 221.204.214.156 222.186.3.142 222.186.3.143 222.218.45.222 223.26.70.11 223.26.70.37 224.0.0.251 23.22.88.133 23.3.105.106 23.3.105.49 23.3.105.56 23.3.105.57 23.3.105.59 23.3.105.66 23.3.105.72 23.3.105.73 23.3.105.74 23.3.105.81 23.3.105.89 23.48.130.217 23.48.143.139 23.61.194.11 23.61.194.33 23.61.194.34 23.61.194.40 23.61.194.41 23.61.194.42 23.61.194.43 23.61.194.48 23.61.194.49 23.61.194.58 23.61.194.73 23.61.194.75 23.76.204.153 23.76.204.155 23.76.204.162 23.76.204.163 23.76.204.168 23.76.204.170 23.76.204.176 23.76.204.177 255.255.255.255 42.121.149.41 42.121.149.44 42.156.140.138 42.99.128.137 42.99.128.144 54.208.251.43 54.236.156.216 54.236.90.22 54.240.226.0 54.240.227.64 60.209.6.137 60.55.32.68 60.55.32.84 60.55.32.90 63.151.118.145 64.208.5.16 64.208.5.51 74.125.203.156

• …omit…

Srcaddr Dstaddr

Page 54: A Concept of Network Analysis Tool by Data Mining

Mahout 上場

mahout fpg -i ".$input." -o ".$output." -k 50 -s 2 -method mapreduce -regex ['\ ']

在透過 PHP 將輸出結果處理一下加上 bootstrap 的表格

大概就會長這樣

Page 55: A Concept of Network Analysis Tool by Data Mining
Page 56: A Concept of Network Analysis Tool by Data Mining

還可以做什麼之 NTOP

• 也可以做一個傳統的 NTOP

• 比較 frequent pattern

• 了解更多網路狀況

Page 57: A Concept of Network Analysis Tool by Data Mining

合體

就是合體

Page 58: A Concept of Network Analysis Tool by Data Mining

10minute

並放上 bootstrap 的外觀就會像這樣

Data description

clustering

Frequent pattern

10 minute information 1

10 minute information 210 minute information 3

10 minute information n

data

Page 59: A Concept of Network Analysis Tool by Data Mining
Page 60: A Concept of Network Analysis Tool by Data Mining
Page 61: A Concept of Network Analysis Tool by Data Mining
Page 62: A Concept of Network Analysis Tool by Data Mining
Page 63: A Concept of Network Analysis Tool by Data Mining
Page 64: A Concept of Network Analysis Tool by Data Mining
Page 65: A Concept of Network Analysis Tool by Data Mining
Page 66: A Concept of Network Analysis Tool by Data Mining
Page 67: A Concept of Network Analysis Tool by Data Mining

Downlaod

http://flowdm.openfoundry.org

Page 68: A Concept of Network Analysis Tool by Data Mining
Page 69: A Concept of Network Analysis Tool by Data Mining

注意事項因為這個只是一個概念的呈現 , 所以可以和大家保證三點

• 無任何安全保護機制 (防火牆記得設一下 )

• 程式寫的有點笨 ( 有些操作會很慢 )

• 你應該會遇到 Bug(測試時間還不夠長 )

Page 70: A Concept of Network Analysis Tool by Data Mining

Conclusion

引起高手們的興趣開發更強的網路流量分析工具

Page 71: A Concept of Network Analysis Tool by Data Mining

謝謝大家

有任何想法或覺得可以改的或任何問題

都歡迎來信交流

http://systw.net

最後