Upload
vuongthien
View
216
Download
0
Embed Size (px)
Citation preview
User Entity Behavior Analysis for Cyber Security
1
Dr. Chin-Hao, Eric, Mao ([email protected])Institute for Information Industry
2016.09.13
About me• Section Manager, Cyber Trust Technology Institute, Institute for Information Industry
• EDUCATIONS and EXPERIENCES• Ph. D., National Taiwan University of Science and Technology, Computer Science
• Visiting scholar in Carnegie Mellon University, CyLab • Presentation: FIRST 2013/2014, CSA-APAC 2013, APCERT2014, (ISC)2 APAC Congress 2016
• RESEARCH EXPERIENCES• big data analytics, network security, intrusion detection, mobile APP static analysis
• more than 20+ academic journal and conferences papers
2
Outline• What “Data Analytics in Information Security” do.• Supervised Learning
– Decision Tree– if2
• Unsupervised Learning- GMiner• Anomaly Detection & Graph Mining
– ChainSpot (Sequential)– PlayWright (Graph)
• Recommendation System & Data Visualization– CloudOrion
• Conclusions
4
What “Data Analytics in Information Security” do
• Log Analysis– IDS Logs– Firewall Logs– Web Server Access Logs– Proxy Logs– Active Directory Logs
• Social Media– Twitter, FB, …
5
• Host IDS (HIDS) vs. Network IDS (NIDS)
• Detect the known attacking signature– Port, attacking vector, or payload
– Highly depends on human fine tuning
– False alarm issue• Usually contains following information:
– Event Name, Source IP, Source Port, Dest. IP, Dest. Port,
TimeStamp
• A well-studied domain:
– False Alarm Reduction– Multi-steps Attacks correlation
• Applied scenario: SOC (security operation center), CERT,
ISAC...
Intrusion Detection System (IDS)
Firewall Logs
• Firewall: blocking connection or transmission based on known
blacklist or pre-defined policy
• Easy to deployment and use.
• Lacking of analysis ability as facing unseen attacking
signature. (e.g. IP, Domain Name)
• Contain following information:– Source IP, Source Port, Dest. IP, Dest. Port, Permit/Deny
• Also well-studied on large-scaled connection correlation
– Honeypot attacking pattern analysis
– Botnet structure analysis
Web Server Accessing Log
• Application-layer malicious behavior detection
• Web Server Accessing Logs contain:
– Source IP, Accessed Path, Access Status, Timestamp, Http
header (optional), File Size(optional)
• Usually, a tradeoff is between log visibility and server
execution performance
• Only application-layer malicious behavior such as script can be
detected. (entry point into system level threats)
• Research emerged around 2005:
– Analyze the accessed path to detect SQL injection
– Graph mining on association graph of web access
– Build a classifier to predict a risk of a given request
Users’ (visitor) Behavior!!
Proxy Logs
• Proxy is an agent deployed on gateway of an enterprise
network.
• Recorded the Web surfing behavior
• Why “web surfing behavior” is important
– URL connection for C&C server communication and control.
– Detect the phishing web site.
– Detect malicious scanning attack.
• Containing:
– Source IP, Dest. URL, TimeStamp, Web Agent Info., MIME
…
• Research emerged around 2010– Analyze the malware behavior inside the enterprise network.
– Correlate other logs (IDS, FW, AD logs, etc.) to evaluate risk of an
account
User’s Behavior!!
Active Directory Logs
• Operating system records connections or events, between client side and server side, of registered accounts.
• Containing:–TimeStamp, Account, Source IP, Event Name, Sub-
Event Annotation, …• Can be used to detect hidden malicious behavior:
–Anomaly login/logout behavior
–Anomaly resource allocation –Anomaly permission granting
• Recently, AD is well-used on event tracking, however, it resulted in issue that AD data is big scale–Companies in industry focused on AD-related research
in last 2-3 years.
User’s Behavior!!
A Decision Tree Example – The Weather DataID code Outlook Temperature Humidity Windy Play
a
b
c
d
e
f
g
h
i
j
k
l
m
n
Sunny
Sunny
Overcast
Rainy
Rainy
Rainy
Overcast
Sunny
Sunny
Rainy
Sunny
Overcast
Overcast
Rainy
Hot
Hot
Hot
Mild
Cool
Cool
Cool
Mild
Cool
Mild
Mild
Mild
Hot
Mild
High
High
High
High
Normal
Normal
Normal
High
Normal
Normal
Normal
High
Normal
High
False
True
False
False
False
True
True
False
False
False
True
True
False
True
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
A Decision Tree ExampleOutlook
humidity windyyes
no yesyes no
sunny overcast rainy
high normal false true
Decision tree for the weather data.
iF2: An Interpretable Fuzzy Rule Filter for Web LogPost-Compromised Malicious Activity Monitoring
Chih-Hung Hsieh, Yu-Siang Shen, Chao-Wen Li, and Jain-Shing Wu
AsiaJCIS 2015
16
Motivation• Traditionally, log tracking depending
on heavy human effort • Common method for identifying APT
attacks become less useful because of various camouflages: VPN, Packing.
• That means you should have lots of time, domain experience, and good luck very much!
17
Interpretable Fuzzy Rule Filter (iF2)• In fuzzy logic, a fuzzy rule base consists of
multiple fuzzy rules with following structure
18
Exp. 3 – Using iF2 to Analyze Web Access Log• One real web access log dataset from a
certain organization in Taiwan was collected for usage of demonstrating the performance of iF2.
• Total 5.36 gigabytes and 45,669,917 records from 545 distinct internet addresses are analyzed and labeled.
• There are only 22 suspicious internet addresses and 523 benign ones, and it is a typically extremely unbalanced dataset.
19
Exp. – Resulted Rules• The Human interpretable rule base of iF2 to differentiate suspicious internet addresses of 545-IPs from benign ones.
21
Motivation
• Target of APT is usually specific and personal
– Attackers usually invade enterprise and access sensitive data by using compromised accounts.
• We focus on detecting anomaly behavior or behavioral deviation for each employee (account) in an enterprise.
24
ChainSpot - Method
• We concern the anomaly behaviors extracted from Active
Directory (Kerberos or NTLM authentication) & Proxy
Server (web surfing usage) for each account.
• Sequential Data Synchronizer
– Correlate heterogeneous data (AD and Proxy) based on IP
Address and interval time (3 days) of SOC Tickets
25
Proxy
timeline
… …
SOC
Ticket i3 days
AD… …
ChainSpot - Method
• ChainSpot model sequential behaviors as a probabilistic model
as baseline, and any change on employee’s behaviors will results
in an anomaly which may imply:
– Account has been compromised
– APT exists in an employee’s host
• To properly model the sequential behaviors as a probabilistic
model for each account, and to detect the anomalies based on
deviation of account’s behaviors.
– Markov Model for building probabilistic model
– Graph Edit Distance for estimating behavioral deviation
26
ChainSpot Method – Markov Model
Transition Probability Matrix
27
An example of Markov Model
For each account, we will build his Markov model using his normal action sequences
Heterogeneous Data Correlation
• For each account, we build his personal profile of state sequences
which describe this account’s sequential behaviors.
• Each state in a sequential data consists of followings
– For AD log sequence
• Event (e.g., No 4624, 4634)
• Reply Code (Only 4771 -> 0x12, 0x18)
– For Proxy log sequence
• A Meta Behavior describing web surfing.
( e.g., GET and Download on microsoft.com then Failed )
28
� HTTP Method
� E,g., GET, POST, …
� Download or Upload� Download when size in > size out
� Upload when size in < size out
� Domain Name
� Second Level Domain
� Access Result� E.g., Allowed, Failed, …
Experiment regarding to real environment
• Environment & Dataset Description:
– Contains about 1,089 accounts.
– Duration from 2015/08/01 to 2015/08/31.
– The active directory domain service of Windows Servers ver. 2008-R2 results in
27,902,857 logs.
– The proxy service in the same duration generates 78,044,332 logs.
• Dataset
– Number of Categories in SOC Tickets: 6
29
Experiments (Cont’d)
• Evaluation
– We divide Aug. logs of each account into three partitions:
• Training data
– A half of unlabeled data mapped by SOC Tickets
• Abnormal Testing data
– Labeled data mapped by SOC Tickets
• Normal Testing data
– Selection from all unlabeled data except Training data
– Compare the difference of Training data between Anomaly and Normal
Testing data
– Experiment Hypothesis
30
Effectiveness of ChainSpot
31
Hypothesis:
The general effectiveness measuring gives 85.66% and 87.17% success rates
in terms of averaging on various types or averaging on different accounts.
33
TrainingNormal TestingAbnormal Testing
4658
4661
Abnormal Data has 4768 -> 4771 0x18, 4771 0x18 -> 4768
A Kerberos authentication ticket (TGT) was requested (4768)
Kerberos pre-authentication failed (Bad Password) (4771 0x18)
Abnormal Data has no 4661 -> 4662, 4662 -> 4658
A handle to an object was requested (4661)
An operation was performed on an object (4662)
The handle to an object was closed (4658)
There is no complete service path in abnormal data, and
which has error authentication by bad password
AD Case Study - d110382
Multiple Accounts login from single IP in short time
34
Normal Testing
Abnormal and Normal Data both have followings :Abnormal Data has no followings :
AD Case Study - administratorToo many logins in closing time
Training
An account failed
to log on(4625)
Special privileges assigned
to new logon(4672)
The domain controller attempted to
validate the credentials for an account
(4776)
An account failed
to log on(4625)
The handle to an object was closed (4658)
A handle to an object was requested (4661)
A new process has been created (4688)
A Kerberos authentication ticket (TGT) was requested (4768)
Kerberos pre-authentication failed (Bad PWD) (4771 0x18)A user account was locked out (4740)
Abnormal Testing
The Anomaly Logs Filter Module
• Intention 1– Web log volume too large to inspect with bareeyes
• Input– Raw logs from web access record
• Output– Filtered logs
• Example– Illegal Characters Verify Module– Unsual Parameter Usage Verify
39
The Attack Scene Matching Module• Intention 2– There may not be only one candidate ofattacking Tactics, Techniques, and Processes (TTPs).
• Input– Kinds of filtered logs
• Output– Attacking Scene with Associated Filtered logs
40
The Graph Constructor Module• Intention 3– Concretize entities and associated relations of suspicious activities.
• Input– Each Attacking Scene withAssociated Filtered logs
• Output – Suspicious Activity Graph• Heterogeneous Graph• Bipartite Graph• Homogenous Graph 41
168.144.85.166
59.124.20.38
118.140.69.162
1.asp
2.asp
personal_taglist.asp
cisc_search.asp
Hacker maintains 3 IPshack.asp
The Graph Pattern Matching Module
• Intention 4– Find trigger points (or threat seeds) where attacker is in an attempt to invade target
• Input– Suspicious Activity Graph• Heterogeneous Graph• Bipartite Graph• Homogenous Graph
• Output– Entity labeled as suspect 42
• Case 1 : IIS Logs – System Compromise• Case 2 : Apache Logs – SQL Injection for
Data Leakage• Case 3 : Apache Logs – System
Compromise43
• Case 1 : IIS Logs – System Compromise• Case 2 : Apache Logs – SQL Injection for
Data Leakage• Case 3 : Apache Logs – System
Compromise44
Suspicious Activity Graph Construction - IIS
File Name IP Address Status Code Illegal Character Hacker’s Action (File)
Hacker’s Action (IP Address
• Scope of Attacking
– Each IP correlates multiple files with illegal character.
Scope of Attacking
45
Initial Threat Seed Finding:
with priori knowledge about the threat pattern• Our forensics team gives us priori knowledge to form the
following patterns:
Initial Seeds Finding Regarding to Known Pattern
47
• How to ranking nodes in a graph based on their referring to each other.
C
A B
•Originally be used to rank Google retrieved pages
Page Rank
Algorithm
Is IP which hacker used different from others in Suspicious Activity Graph ?! - IIS
• Problem:– Investigate the role of IP Address in Suspicious Activity Graph
• Input:– |IPs| x |IPs| Matrix• nij: num of Queried same File
– IPi to IPj• 0: Queried by different File
• Output:– Clustering Result IP X IP
IP1
IP2
.
.
.
IPN
IP1 IP2… IPN
n11 0 … n1N
n21 0 … 0
0 0 … nNN
…
48
49
0
0.02
0.04
0.06
0.08
0.1
0.12
140.1
12.2
4.1
46
140.9
2.1
8.6
8
140.9
2.1
43.1
22
140.9
2.4
7.1
89
140.9
2.4
6.1
03
101.1
2.1
.70
140.9
2.6
3.2
35
140.9
2.4
6.1
01
140.9
2.5
5.3
2
140.9
2.2
3.1
16
223.1
37.1
2.2
42
140.9
2.4
6.1
86
140.9
2.1
8.6
6
118.1
60.9
8.5
114.1
36.2
44.1
6
140.9
2.1
42.1
27
140.9
2.5
3.1
31
49.2
15.3
3.1
73
49.2
14.3
5.1
16
192.1
68.3
0.7
6
101.1
2.1
20.8
3
125.2
27.1
52.9
2
140.9
2.4
6.8
7
220.1
34.1
85.7
5
140.9
2.5
.26
111.2
51.1
8.5
6
140.9
2.1
43.5
140.9
2.2
2.2
28
140.9
2.3
.100
140.9
2.1
46.2
26
58.1
15.6
2.7
1
140.9
2.1
43.1
04
36.2
26.1
65.1
91
140.9
2.2
3.1
21
140.9
2.6
3.3
7
192.1
68.1
.222
114.1
36.1
82.5
8
27.2
47.6
8.7
3
140.9
2.3
0.3
7
140.9
2.6
2.1
03
195.1
54.1
84.1
26
149.2
02.8
9.1
87
140.9
2.1
41.1
2
140.9
2.1
8.1
39
140.9
2.1
42.5
3
140.9
2.6
3.3
4
140.9
2.6
3.1
29
140.9
2.6
3.1
19
140.9
2.7
1.6
1
140.9
2.6
3.1
87
140.9
2.1
43.2
4
140.9
2.6
2.1
16
140.9
2.3
9.1
9
188.1
65.2
07.1
40
140.9
2.7
7.1
28
140.9
2.1
41.2
18
220.1
30.1
96.2
18
140.9
2.3
0.5
3
140.9
2.1
8.1
99
140.9
2.6
2.1
99
207.4
6.1
3.4
8
140.9
2.6
2.1
59
192.1
68.4
0.1
05
157.5
5.3
9.1
35
140.9
2.3
8.2
157.5
5.3
9.2
44
118.1
63.1
50.2
36
157.5
5.3
9.8
7
140.9
2.1
0.1
97
66.2
49.6
5.2
4
207.4
6.1
3.2
4
140.9
2.6
2.1
08
66.2
49.7
7.2
9
140.9
2.6
2.1
73
140.9
2.5
3.4
8
0.00E+00
1.00E-01
2.00E-01
3.00E-01
4.00E-01
5.00E-01
6.00E-01
140.1
12.2
4.1
46
114.4
4.2
16.1
53
192.1
68.7
0.1
22
140.9
2.1
41.1
50
140.9
2.5
5.2
13
140.9
2.6
2.2
21
110.2
6.4
.111
140.9
2.7
1.7
4140.1
12.2
4.1
49
140.9
2.4
6.1
75
58.2
41.2
.154
192.1
68.6
0.2
28
195.1
54.1
76.1
5140.9
2.6
0.2
53
140.9
2.7
1.2
8140.9
2.3
8.8
6140.9
2.3
0.1
97
140.9
2.3
0.8
2223.1
40.1
92.1
20
140.9
2.1
42.1
18
140.9
2.7
7.6
5140.9
2.1
6.8
059.1
15.6
7.1
93
140.9
2.2
2.1
12
140.9
2.1
46.2
40
140.9
2.1
41.1
21
111.8
0.1
67.2
12
140.9
2.1
8.2
5192.1
68.6
0.2
39
140.9
2.1
46.6
5140.9
2.1
4.4
7140.9
2.1
8.1
81
192.1
68.3
0.4
8175.9
6.2
35.8
5192.1
68.1
.61
140.9
2.1
8.4
2140.9
2.3
.20
140.9
2.5
3.1
17
140.9
2.3
9.3
5140.9
2.1
46.4
8140.9
2.2
3.1
70
140.9
2.2
3.1
27
192.1
68.4
2.2
08
140.9
2.7
1.2
6140.9
2.2
9.5
2140.9
2.1
42.5
3140.9
2.3
9.4
440.7
7.1
67.9
2140.9
2.2
3.9
1140.9
2.2
9.9
7140.9
2.3
8.9
2140.9
2.6
2.1
83
140.9
2.7
7.1
82
182.9
3.5
5.4
3106.1
.51.1
1140.9
2.5
.102
140.9
2.7
1.6
114.2
00.1
22.2
07
140.9
2.3
0.1
8176.1
16.2
18.1
53
140.9
2.2
2.2
23
140.9
2.2
3.5
9140.9
2.1
46.2
29
140.9
2.1
4.4
8140.9
2.6
0.4
1140.9
2.6
3.1
73
140.9
2.1
46.1
49
140.9
2.6
2.1
79
140.9
2.3
.34
140.9
2.7
1.2
27
140.9
2.1
43.7
861.6
5.1
2.5
0114.1
36.3
1.1
94
140.9
2.1
42.6
0140.9
2.5
3.4
8
Pagerank
Belief Propagation
Ni : The BP Score of ith IPNmax : The Maximum BP Score
Ni : The PR Score of ith IPNmax : The Maximum PR Score
Evidence IPs
Never Miss
118.140.69.162
140.92.10.60
140.92.53.48
168.144.85.166
59.124.20.38
Blue: True PositiveGreen: Suspicious
118.140.69.162
140.92.10.60
59.124.20.38
Evidence IPs
Miss 168.144.85.166
Initial Threat Seed Finding:
with priori knowledge about the threat pattern
Initial Seeds Finding without Known Pattern
50
• How about the case that you don’t have any priori knowledge
•Effect of SVD
SVD
Is IP which hacker used different from others in Suspicious Activity Graph ?! (Cont’d)
51
c0
c1
c0
c2
c1
c2
168.144.85.166
168.144.85.166
168.144.85.16
118.140.69.162
118.140.69.162118.140.69.162
59.124.20.38
59.124.20.38
59.124.20.38
SVD
Benign
Malicious
• Hackers usually use different IPs to achieve their goal– Test the Trojan (China Chopper)
• Weak Link IP– Scan the architecture of website
• Outlier (Decomposition) & Pattern (Pagerank)
52
IP1
China Chopp
er
IP2File Name
IP Address
Hacker’s Action (IP Address)
Weak LinkThis kind of IP is not an outlier and easily ignored by SVD
Limitation - SVD
• How to resolve Weak Link problem?!– Use Graph Mining Algo. to propagate the threat score from IP1 (Seed) to IP2
– Random Walk with Restart (RWR)• Threat Seed as Start Point
53
IP1
China Chopp
er
IP2File Name
IP Address
Hacker’s Action (IP Address)
Weak Link
Limitation - SVD
54
1
4
3
2
56
7
910
8
11
12
Random walk with restart
How to propagate the confidence once you have suspect candidate
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
RWR
Ni : The RWR Score of ith IPNmax : The Maximum PR Score
Evidence IPs
Never MissBlue: True PositiveGreen: Suspicious
59.124.20.38 106.185.46.126 140.92.77.78 125.227.251.25
140.92.10.60 66.249.77.107 140.92.18.143
118.140.69.162 140.92.16.59 1.162.96.56
89.145.95.43 140.92.253.6 168.144.85.166
89.145.95.41 140.92.142.60 61.65.12.50
Threat Seed Propagation – Given Data- or Knowledge-Driven Threat Seeds
RWR
Ni : The RWR Score of ith IPNmax : The Maximum PR Score
Evidence IPs
0
0.05
0.1
0.15
0.2
0.25
0.3Never Miss
59.124.20.38 1.162.96.56
118.140.69.162 168.144.85.166
66.249.77.107
106.185.46.126
Blue: True PositiveGreen: Suspicious
Threat Seed Propagation – Given Data- or Knowledge-Driven Threat Seeds
• Case 1 : IIS Logs – System Compromise• Case 2 : Apache Logs – SQL Injection for
Data Leakage• Case 3 : Apache Logs – System
Compromise57
Suspicious Activity Graph Construction - Apache
• Case 2: Frequently SQL Injection1. Previous Pattern can’t be used because of ./ is the index page (e.g.,
index.php, index.html and so on)
2. Need to consider the attribute: Illegal Characteristic
58
File Name
IP Address
Illegal Character
Hacker’s Action (File)
Hacker’s Action (IP Address)
Hacker’sIllegal Character
59
• Problem:
– Investigate the role of IP Address in Suspicious Activity Graph
– Outliers may be the candidates of initial threat seeds
• Input:
– |IPs| x |(Filem, IllegalCharn)| Matrix
• nij: num of queried same (Filem, IllegalCharn)
by IPi and IPj
• 0: Never Queried by
(Filem, IllegalCharn)
• Output:
– Clustering Result
IP X (Filem, IllegalCharn)
IP1
IP2
.
.
.
IPN
(Filem, IllegalCharn)1… (Filem, IllegalCharn)mXn
n11 0 … nmXnX1
n21 0 … 0
0 0 … nmXnXN
…
Initial Threat Seed Finding:
without priori knowledge
• Our forensics team gives us priori knowledge to form the
following patterns:
Initial Threat Seed Finding:
with priori knowledge about the threat pattern
61
Ni : The PR Score of ith IPNmax : The Maximum PR Score
PAGERANK
0
0.01
0.02
0.03
0.04
0.05
0.06
62.2
10.1
43.2
45
69.5
8.1
78.5
6
210.6
.198.5
140.1
12.1
25.6
6
1.1
61.1
64.1
4
114.4
6.5
7.3
9
36.2
32.1
16.2
00
120.1
10.8
9.3
3
66.2
49.7
1.2
52
202.4
6.5
6.1
36
210.2
42.2
28.3
6
203.1
98.2
27.7
3
223.1
6.1
1.1
8
123.1
25.7
1.1
5
199.3
0.2
5.1
65
192.2
43.5
5.1
34
61.2
30.2
48.1
66
101.1
39.1
06.2
30
42.6
7.1
00.1
44
36.2
38.9
5.1
58
114.2
4.7
3.1
53
66.2
20.1
46.2
6
211.2
3.2
51.1
57
118.1
70.2
05.1
70
220.1
81.1
08.1
78
65.5
5.2
10.9
0
211.1
04.1
60.1
64
157.5
5.3
9.2
5
208.4
3.2
25.8
4
151.8
0.3
1.1
64
40.7
7.1
67.8
9
150.7
0.1
73.4
1
69.3
0.1
98.2
42
192.2
43.5
5.1
33
207.4
6.1
3.7
0
66.1
02.6
.250
157.5
5.3
9.8
65.5
5.2
10.2
48
59.1
15.1
66.7
3
61.6
5.1
4.4
9
36.2
32.2
18.3
0
40.7
7.1
67.1
04
36.2
27.2
34.9
4
111.2
54.1
66.1
28
140.1
12.7
7.6
118.1
60.2
38.2
9
140.1
19.5
3.2
0
83.168.250.50
39.12.198.8
110.30.221.76
91.194.84.115
Evidence IPsBlue: True PositiveGreen: Suspicious
Miss 120.236.174.19
Initial Threat Seed Finding:
with priori knowledge about the threat pattern
62
Ni : The PR Score of ith IPNmax : The Maximum PR Score
PAGERANK
Evidence IPs
Blue: True PositiveGreen: Suspicious
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
120.236.174.190
91.194.84.115
83.168.250.50 Never Miss
Initial Threat Seed Finding:
with priori knowledge about the threat pattern
63
TENSORc0
c1
83.168.250.50
120.236.174.190
91.194.84.115
c2
83.168.250.50
120.236.174.190
91.194.84.115
c0
c1
c2
83.168.250.50
120.236.174.190
91.194.84.115
• 83.168.250.50 Only Query ./• # Query is similar to normal
120.236.174.190
83.168.250.50
91.194.84.115
We change same # Query of
83.168.250.50
with others
Benign
Malicious
Initial Threat Seed Finding: without priori knowledge
• Use 120.236.174.190 and 91.194.84.115 (Tensor Results Outlier ) as
Start Points for Random Walk with Restart (RWR)
• We expect that 83.168.250.50 (in Unknown Data) can be detected by
RWR
• Input:
– |IPs| x |IPs| | Matrix
• nij: num of queried same (Filem, IllegalCharn) by IPi and IPj
• 0: Never Queried by (Filem, IllegalCharn)
IP X IP
IP1
IP2
.
.
.
IPN
IP1 IP2… IPN
n11 0 … n1N
n21 0 … 0
0 0 … nNN
…
c1
c2
83.168.250.50
120.236.174.190
91.194.84.115
Unknown Data
Threat Seed Propagation – Given Data- or Knowledge-Driven Threat Seeds
Ni : The RWR Score of ith IPNmax : The Maximum PR Score
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
Threat Seed Propagation – Given Data- or Knowledge-Driven Threat Seeds
RWR
Ni : The RWR Score of ith IPNmax : The Maximum PR Score
0
0.02
0.04
0.06
0.08
0.1
0.12
207.4
6.1
3.7
366.2
49.7
5.2
48
39.1
2.1
98.8
110.3
0.2
21.7
666.2
49.7
1.1
28
66.2
49.7
1.2
48
40.7
7.1
67.1
04
208.4
3.2
25.8
5101.2
26.1
68.2
36
69.3
0.2
11.2
51.2
54.1
31.2
45
61.1
35.1
89.1
24
69.3
0.1
98.2
42
192.2
43.5
5.1
36
63.1
41.2
26.1
78
192.2
43.5
5.1
32
52.1
92.9
.202
52.1
96.1
4.2
48
211.1
04.1
60.1
64
173.2
08.1
69.4
2207.4
6.1
3.3
195.1
54.1
99.2
35
62.2
10.1
43.2
45
69.3
0.1
98.1
86
203.1
04.1
45.5
9192.2
43.5
5.1
29
157.5
5.3
9.1
92
178.6
3.1
3.1
566.2
49.7
1.2
54
192.2
43.5
5.1
31
138.2
01.3
0.6
6144.7
6.2
9.6
669.3
0.2
13.8
261.2
31.1
93.1
39
66.2
20.1
45.2
45
122.1
16.2
20.5
266.2
20.1
56.1
00
101.2
26.1
68.2
54
69.5
8.1
78.5
669.8
4.2
07.2
46
89.1
63.1
48.5
8142.4
.216.1
72
208.1
15.1
11.7
585.9
3.8
9.7
4208.1
15.1
13.9
168.1
80.2
30.4
791.1
94.8
4.1
15
83.1
68.2
50.5
0120.2
36.1
74.1
90
Threat Seed Propagation – Given Data- or Knowledge-Driven Threat Seeds
• Case 1 : IIS Logs – System Compromise• Case 2 : Apache Logs – SQL Injection for
Data Leakage• Case 3 : Apache Logs – System
Compromise67
Case 3: Suspicious Activity Graph Construction - Apache
File Name IP Address Status Code Illegal Character Hacker’s Action (File)
Hacker’s Action (IP Address
• Scope of Attacking
– Each IP correlates multiple files with illegal character.
68
Scope of Attacking
Case 3: Is role of IP different from others file hacker used in Suspicious Activity Graph ?1 -
Apache• Problem:– Investigate the role of IP Address in Suspicious Activity Graph
• Input:– |IPs| x |IPs| Matrix• nij: num of Queried same File
– IPi to IPj• 0: Queried by different File
• Output:– Clustering Result IP X IP
IP1
IP2
.
.
.
IPN
IP1 IP2… IPN
n11 0 … n1N
n21 0 … 0
0 0 … nNN
…
69
Case 3: Is role of IP different from others file hacker used in Suspicious Activity Graph ?1 -
Apache (Cont’d)
70
SVD
Benign
Malicious
c1
c0
61.209.216.6
223.75.11.56
220.133.111.221
Benign192.225.226.197192.225.226.196
c0
c2
c2
c1
61.209.216.6
223.75.11.56
220.133.111.221
Benign192.225.226.197192.225.226.196
61.209.216.
Benign192.225.226.197192.225.226.196
223.75.11.56
220.133.111.221
Case 3: Why using Propagation to evaluate Suspicious IPs?! - Apache
• The correlation between Files and IPs can’t be evaluate in decomposition SVD
• We mention that evidence have the following patterns:
71
Patterns
IP
File
File
File
File
File
File
File
Case 3: Why using Propagation to evaluate Suspicious IPs?! - Apache (Cont’d)
72
• Design a directed homogeneous graph to illustrate above patterns we found
• In this pattern, we expect to obtain high score IPs from graph when using propagation algorithm– Pagerank
IP
File
File
File
File
File
File
File
73
Case 3: Why using Propagation to evaluate Suspicious IPs?! (Cont’d) – High Score IPs
Ni : The PR Score of ith IPNmax : The Maximum PR Score
PAGERANK
Evidence IPsBlue: True PositiveGreen: Suspicious
Miss192.225.226.197192.225.226.196
0.00E+00
5.00E-02
1.00E-01
1.50E-01
2.00E-01
2.50E-01
3.00E-01
101.1
2.1
03.4
5219.8
5.1
05.6
859.1
15.1
78.2
51
223.1
41.6
6.2
00
220.1
36.2
22.1
23
119.2
46.2
1.3
6220.1
30.1
87.3
149.2
14.1
77.1
17
114.4
1.1
82.1
110.2
7.3
.241
60.2
46.1
31.1
37
140.1
27.2
33.1
839.1
5.1
28.3
5118.1
66.1
80.2
40
123.1
25.7
1.6
01.1
75.6
2.1
26
123.1
25.7
1.2
8125.2
27.1
16.2
48.3
7.2
35.6
136.2
31.1
08.5
1219.8
4.2
36.1
22
140.1
12.2
4.1
05
140.1
27.2
52.5
2114.3
8.8
2.5
239.1
2.2
08.9
042.7
2.7
4.2
5113.2
55.1
09.9
9220.1
34.2
04.1
39
1.1
63.7
8.2
45
70.1
18.2
06.1
48
123.1
94.2
9.1
06
101.2
26.1
66.2
51
223.1
39.1
18.9
6220.1
33.1
16.8
8140.1
12.1
39.4
866.2
49.7
7.1
11
61.2
28.1
88.2
30
49.1
59.2
7.2
11
220.1
81.1
08.1
76
123.1
94.2
08.1
11
218.1
66.1
33.1
94
220.1
81.1
08.1
87
180.2
06.1
3.6
7112.1
05.5
2.2
28
151.8
0.4
1.1
69
65.2
08.1
51.1
17
91.1
94.8
4.1
06
210.6
8.5
3.1
12
61.2
19.7
0.1
66
223.75.11.56
220.133.111.221
61.209.216.6
CloudOrion- Google Drives Logs• Analyzing cloud user behavior
76
Home Mobile Office
Access Log
Scenario PolicyAdministrator
Visualization Threats Explorer
• Intuition to aware the potential threats in cloud
• Drill-down and quick response visualization analytics
– Full and deep views between cloud resources and users
behavior
User and Entity
Behavior Analysis
User access log
Danger Zone
77
CTA UEBA
Collaborative Filtering
Temporal Analysis
On Duty
Off Duty
MaliciousBehavio
r
Spatial Analysis
78
Deep LearningTime-Series Analysis
Using machine learning and artificial
intelligence that analyze cloud logs via
API sets CTA in cloud apart from state-of-
the-art cloud security solution
Scenario Policy Enforcement
• Manage Google Drive and OAuth Tokens• Decreasing risks of data leakage via Scenario Policy
deployment
2016-07-01 18:10:23GMTModify a file in Drive
from Taiwan IP
2016-07-01 18:20:24GMTLogin from US IP
Super Travel Policy
Detecting account compromising by 10+
scenario policies
79
Abnormal Behavior Detection
• Construct user behavior model over files in Drive
– User and Entity Behavior (UEBA)
• Quick identify potential threats
User access log
Drive file proximity score measuring
Structure-oriented risk propagation
Reversed collaborative filtering from file access
behavior
Users past access behavior matrix
User recent access behavior vector
Edit, view and other behavior
Detecting insider
suspect with high-
risk behavior
80
Case Study
Inspected more than 1200 user accounts with
more than 400,000 event logs every day
About 100,000 distinct user-file access
records every day
Mantain 1200 increamental
individual behavior models
Find out anomaly access
record from humongous
of logs
Automatically
81
Conclusions• Easy start 1-2-3– Anaconda Python (https://www.continuum.io/downloads)• iPython notebook!!
– Scikit-learn (http://scikit-learn.org/stable/)– Tensor Flow (https://www.tensorflow.org/)– Bokeh (http://bokeh.pydata.org/en/latest/)– NetworkX (https://networkx.github.io/)– ElasticSearch (https://www.elastic.co/)
82
Conclusions• Good Tutorial– Document Clustering with Python
– http://brandonrose.org/clustering• Using Machine Learning to Track IoT
Vulnerability and Threats– Unstructured data– Structure data– Together them
83