Upload
trinhliem
View
216
Download
0
Embed Size (px)
Citation preview
Faloutsos 15826
1
CMU SCS
15-826: Multimedia Databases and Data Mining
Bonus Lecture : Anomaly detection in large graphs
Christos Faloutsos
CMU SCS
Repeat: BONUS LECTURE I.e., you may skip it, and still get 100/100 in the course
15-826 Copyright: C. Faloutsos (2016) 2
CMU SCS
Roadmap
• Introduction – Motivation • Unsupervised
– static graphs – Time-evolving graphs
• Supervised: BP • Conclusions
15-826 Copyright: C. Faloutsos (2016) 3
CMU SCS Fraud Detection: Graph Analysis
Problem
[www.buyfollowz.org]
[buymorelikes.com]
15-826 Copyright: C. Faloutsos (2016) 4
Faloutsos 15826
2
CMU SCS Fraud Detection: Graph Analysis
Problem
[buycheaplikes.com]
[reviewsteria.com]
15-826 Copyright: C. Faloutsos (2016) 5
CMU SCS
Other applications: • Cyber-security (DDoS attacks – distributed
denial of service) • Abnormal individuals / social groups
– dysfunctional departments in a company – Onset of depression/hijacking wrt a FB user – …
15-826 Copyright: C. Faloutsos (2016) 6
CMU SCS
Roadmap
• Introduction – Motivation • Unsupervised
– static graphs • Eigenspokes, and extensions • fbox
– Time-evolving graphs
• Supervised: BP • Conclusions
15-826 Copyright: C. Faloutsos (2016) 7
CMU SCS
EigenSpokes B. Aditya Prakash, Mukund Seshadri, Ashwin
Sridharan, Sridhar Machiraju and Christos Faloutsos: EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs, PAKDD 2010, Hyderabad, India, 21-24 June 2010.
15-826 Copyright: C. Faloutsos (2016) 8
Faloutsos 15826
3
CMU SCS
EigenSpokes • Eigenvectors of adjacency matrix
! equivalent to singular vectors (symmetric, undirected graph)
A = U�UT
�u1 �ui
N
N
details
15-826 Copyright: C. Faloutsos (2016) 9
CMU SCS
EigenSpokes • EE plot: • Scatter plot of
scores of u1 vs u2 • One would expect
– Many points @ origin
– A few scattered ~randomly u1
u2 90o
15-826 Copyright: C. Faloutsos (2016) 10
CMU SCS
Bipartite Communities!
magnified bipartite community
patents from same inventor(s)
`cut-and-paste’ bibliography!
15-826 Copyright: C. Faloutsos (2016) 11
CMU SCS
Roadmap
• Introduction – Motivation • Unsupervised
– static graphs • Eigenspokes, and extensions • fbox
– Time-evolving graphs
• Supervised: BP • Conclusions
15-826 Copyright: C. Faloutsos (2016) 12
Faloutsos 15826
4
CMU SCS
Inferring Strange Behavior from Connectivity Pattern in Social Networks
Meng Jiang, Peng Cui, Shiqiang Yang (Tsinghua, Beijing)
Alex Beutel, Christos Faloutsos (CMU)
CMU SCS
Lockstep and Spectral Subspace Plot
• Case #0: No lockstep behavior in random power law graph of 1M nodes, 3M edges
• Random “Scatter”
Adjacency Matrix Spectral Subspace Plot
15-826 Copyright: C. Faloutsos (2016) 14
CMU SCS
Lockstep and Spectral Subspace Plot
• Case #1: non-overlapping lockstep • “Blocks” “Rays”
Adjacency Matrix Spectral Subspace Plot
15-826 Copyright: C. Faloutsos (2016) 15
CMU SCS
Lockstep and Spectral Subspace Plot
• Case #2: non-overlapping lockstep • “Blocks; low density” Elongation
Adjacency Matrix Spectral Subspace Plot
15-826 Copyright: C. Faloutsos (2016) 16
Faloutsos 15826
5
CMU SCS
Lockstep and Spectral Subspace Plot
• Case #3: non-overlapping lockstep
Adjacency Matrix Spectral Subspace Plot
??
15-826 Copyright: C. Faloutsos (2016) 17
CMU SCS
Lockstep and Spectral Subspace Plot
• Case #3: non-overlapping lockstep • “Camouflage” (or “Fame”) Tilting
“Rays” Adjacency Matrix Spectral Subspace Plot
15-826 Copyright: C. Faloutsos (2016) 18
CMU SCS
Lockstep and Spectral Subspace Plot
• Case #4: ? lockstep • “?” “Pearls”
Adjacency Matrix Spectral Subspace Plot
?
15-826 Copyright: C. Faloutsos (2016) 19
CMU SCS
Lockstep and Spectral Subspace Plot
• Case #4: overlapping lockstep • “Staircase” “Pearls”
Adjacency Matrix Spectral Subspace Plot
15-826 Copyright: C. Faloutsos (2016) 20
Faloutsos 15826
6
CMU SCS
Roadmap
• Introduction – Motivation • Unsupervised
– static graphs • Eigenspokes, and extensions • fbox
– Time-evolving graphs
• Supervised: BP • Conclusions
15-826 Copyright: C. Faloutsos (2016) 21
CMU SCS
Issue: Flying below the radar • Plotting the top $k$ eigenspokes, will miss
small groups • Q: Can we spot them, too? • A: ‘fBox’
15-826 Copyright: C. Faloutsos (2016) 22
CMU SCS Problem: Social Network Link Fraud
Target: find “stealthy” attackers missed by other algorithms
Clique
Bipartite core
41.7M nodes 1.5B edges
15-826 Copyright: C. Faloutsos (2016) 23
CMU SCS Problem: Social Network Link Fraud
Neil Shah, Alex Beutel, Brian Gallagher and Christos Faloutsos. Spotting Suspicious Link Behavior with fBox: An Adversarial Perspective. ICDM 2014, Shenzhen, China.
Target: find “stealthy” attackers missed by other algorithms
Takeaway: use reconstruction error between true/latent representation!
15-826 Copyright: C. Faloutsos (2016) 24
Faloutsos 15826
7
CMU SCS
Roadmap
• Introduction – Motivation • Unsupervised
– static graphs • Eigenspokes, and extensions • fBox • Spot strange, individual nodes: oddBall
– Time-evolving graphs • Supervised: BP • Conclusions 15-826 Copyright: C. Faloutsos (2016) 25
CMU SCS
OddBall: Spotting Anomalies in Weighted Graphs
Leman Akoglu, Mary McGlohon, Christos Faloutsos
Carnegie Mellon University School of Computer Science
PAKDD 2010, Hyderabad, India
CMU SCS
Main idea For each node, • extract ‘ego-net’ (=1-step-away neighbors) • Extract features (#edges, total weight, etc
etc) • Compare with the rest of the population
15-826 Copyright: C. Faloutsos (2016) 27
CMU SCS What is an egonet?
ego egonet
15-826 Copyright: C. Faloutsos (2016) 28
Faloutsos 15826
8
CMU SCS
Selected Features ! Ni: number of neighbors (degree) of ego i ! Ei: number of edges in egonet i ! Wi: total weight of egonet i ! λw,i: principal eigenvalue of the weighted
adjacency matrix of egonet I
15-826 Copyright: C. Faloutsos (2016) 29
CMU SCS Near-Clique/Star
15-826 Copyright: C. Faloutsos (2016) 30
CMU SCS Near-Clique/Star
15-826 Copyright: C. Faloutsos (2016) 31
CMU SCS Near-Clique/Star
15-826 Copyright: C. Faloutsos (2016) 32
Faloutsos 15826
9
CMU SCS
Andrew Lewis (director)
Near-Clique/Star
15-826 Copyright: C. Faloutsos (2016) 33
CMU SCS
Roadmap
• Introduction – Motivation • Unsupervised
– static graphs – Time-evolving graphs
• ‘copyCatch’ in FaceBook • Patterns of IAT • [tensors]
• Supervised: BP • Conclusions 15-826 Copyright: C. Faloutsos (2016) 34
CMU SCS
Fraud • Given
– Who ‘likes’ what page, and when
• Find – Suspicious users and suspicious
products
CopyCatch: Stopping Group Attacks by Spotting Lockstep Behavior in Social Networks, Alex Beutel, Wanhong Xu, Venkatesan Guruswami, Christopher Palow, Christos Faloutsos WWW, 2013.
15-826 Copyright: C. Faloutsos (2016) 35
CMU SCS
Fraud • Given
– Who ‘likes’ what page, and when
• Find – Suspicious users and suspicious
products
CopyCatch: Stopping Group Attacks by Spotting Lockstep Behavior in Social Networks, Alex Beutel, Wanhong Xu, Venkatesan Guruswami, Christopher Palow, Christos Faloutsos WWW, 2013.
Users Pages
A
B
C
D
E
1
2
3
4
40 Z
Likes
15-826 Copyright: C. Faloutsos (2016) 36
Faloutsos 15826
10
CMU SCS
Our intuition ▪ Lockstep behavior: Same Likes, same time
Graph Patterns and Lockstep Behavior
Pages
Users
0
10
20
30
40
50
60
70
80
90
Tim
e
Users Pages
A
B
C
D
E
1
2
3
4
40 Z
Likes 15-826 Copyright: C. Faloutsos (2016) 37
CMU SCS
Our intuition ▪ Lockstep behavior: Same Likes, same time
Graph Patterns and Lockstep Behavior
Users Pages
A
B
C
D
E
1
2
3
4
40 Z
Pages
Users
0
10
20
30
40
50
60
70
80
90
Tim
e
B CD A E
Reordered Pages
4
3 2 1
Reord
ere
d U
sers
0
10
20
30
40
50
60
70
80
90
Tim
e
Likes 15-826 Copyright: C. Faloutsos (2016) 38
CMU SCS
Our intuition ▪ Lockstep behavior: Same Likes, same time
Graph Patterns and Lockstep Behavior
Users Pages
A
B
C
D
E
1
2
3
4
40 Z
Pages
Users
0
10
20
30
40
50
60
70
80
90
Tim
e
B CD A E
Reordered Pages
4
3 2 1
Reord
ere
d U
sers
0
10
20
30
40
50
60
70
80
90
Tim
e
Suspicious Lockstep Behavior Likes
15-826 Copyright: C. Faloutsos (2016) 39
CMU SCS
MapReduce Overview ▪ Use Hadoop to search for
many clusters in parallel:
1. Start with randomly seed
2. Update set of Pages and center Like times for each cluster
3. Repeat until convergence
Users Pages
A
B
C
D
E
1
2
3
4
40 Z
Likes
15-826 Copyright: C. Faloutsos (2016) 40
Faloutsos 15826
11
CMU SCS
Deployment at Facebook ▪ CopyCatch runs regularly (along with many other
security mechanisms, and a large Site Integrity team)
08/25 09/08 09/22 10/06 10/20 11/03 11/17 12/01
Num
ber
of u
sers
cau
ght
Date of CopyCatch run
3 months of CopyCatch @ Facebook
#users caught
time 15-826 Copyright: C. Faloutsos (2016) 41
CMU SCS
Deployment at Facebook
23%
58%
5%9%
5%
Fake AccountsMalicious Browser ExtensionsOS MalwareCredential StealingSocial Engineering
Manually labeled 22 randomly selected clusters from February 2013
Most clusters (77%) come from real but compromised users
Fake acct
15-826 Copyright: C. Faloutsos (2016) 42
CMU SCS
Roadmap
• Introduction – Motivation • Unsupervised
– static graphs – Time-evolving graphs
• ‘copyCatch’ in FaceBook • Patterns of IAT • [tensors]
• Supervised: BP • Conclusions 15-826 Copyright: C. Faloutsos (2016) 43
CMU SCS
RSC: Mining and Modeling Temporal Activity in Social Media
Alceu F. Costa* Yuto Yamaguchi Agma J. M. Traina
Caetano Traina Jr. Christos Faloutsos
Universidade de São Paulo
KDD 2015 – Sydney, Australia
Faloutsos 15826
12
CMU SCS
Reddit Dataset Time-stamp from comments 21,198 users 20 Million time-stamps
Twitter Dataset Time-stamp from tweets 6,790 users 16 Million time-stamps
Pattern Mining: Datasets
For each user we have: Sequence of postings time-stamps: T = (t1, t2, t3, …) Inter-arrival times (IAT) of postings: (∆1, ∆2, ∆3, …)
t1 t2 t3 t4
∆1 ∆2 ∆3
time 15-826 Copyright: C. Faloutsos (2016) 45
CMU SCS
Guess the IAT pdf • Gaussian (10’ +/- a bit)? • Power law? Slope? • Log-logistic? • Periodic (spike for 1day, 1 week)?
IAT
15-826 Copyright: C. Faloutsos (2016) 46
CMU SCS
Human? Robots?
log
linear
15-826 Copyright: C. Faloutsos (2016) 47
CMU SCS
Human? Robots?
log
linear 2’ 3h 1day
15-826 Copyright: C. Faloutsos (2016) 48
Faloutsos 15826
13
CMU SCS
Pattern Mining Pattern 1: Distribution of IAT is heavy-tailed
Users can be inactive for long periods of time before making new postings IAT Complementary Cumulative Distribution Function (CCDF)
(log-log axis)
105 106 10710−5
100
∆, IAT (seconds)
CC
DF,
P(x
≥ ∆
)
1 day 10 days105 106 107
10−5
100
∆, IAT (seconds)
CC
DF,
P(x
≥ ∆
)1 day10 days
Reddit Users Twitter Users 15-826 Copyright: C. Faloutsos (2016) 49
CMU SCS
Pattern Mining Pattern 2: Bimodal IAT distribution
Users have highly active sections and resting periods
Log-binned histogram of postings IAT
102 104 1060
0.005
0.01
0.015
∆, IAT (seconds)
1st Mode (1min) 2nd Mode (3h)
15-826 Copyright: C. Faloutsos (2016) 50
CMU SCS
102 104 1060
0.005
0.01
∆, IAT (seconds)
Pattern Mining
Pattern 3: Periodic spikes in the IAT distribution
Caused by daily sleeping intervals
1050
0.005
0.01
0.015
∆, IAT (seconds)
7h 12h 24h 48h 72h
15-826 Copyright: C. Faloutsos (2016) 51
CMU SCS
Experiments: Can RSC-Spotter Detect Bots?
Methodology Datasets
Users were manually labeled as bot or humans
Training
Same size for train and test subsets (preserved class distribution)
Baseline features:
1,963 Humans 37 Bots Reddit
1353 Humans 64 Bots Twitter
1. IAT Histogram Log-binned IAT histogram
2. Entropy Entropy of the IAT histogram
3. Week Hist. # of postings for day of week
4. All features Combination of 1, 2 and 3
15-826 Copyright: C. Faloutsos (2016) 52
Faloutsos 15826
14
CMU SCS
Experiments: Can RSC-Spotter Detect Bots?
Precision vs. Sensitivity Curves Good performance: curve close to the top
0 0.2 0.4 0.6 0.8 10
0.20.40.60.8
1
Sensitivity (Recall)
Prec
isio
n
0 0.2 0.4 0.6 0.8 10
0.20.40.60.8
1
Sensitivity (Recall)
Prec
isio
n
RSC−Spotter
IAT Hist.Entropy [6]Weekday Hist.
All Features
Precision > 94% Sensitivity >
70%
With strongly imbalanced
datasets
# humans >> # bots
Twitter Dataset
15-826 Copyright: C. Faloutsos (2016) 53
CMU SCS
Experiments: Can RSC-Spotter Detect Bots?
Precision vs. Sensitivity Curves Good performance: curve close to the top
0 0.2 0.4 0.6 0.8 10
0.20.40.60.8
1
Sensitivity (Recall)
Prec
isio
n
RSC−Spotter
IAT Hist.Entropy [6]Weekday Hist.
All Features
Precision > 96% Sensitivity >
47%
With strongly imbalanced
datasets
# humans >> # bots
Reddit Dataset
15-826 Copyright: C. Faloutsos (2016) 54
CMU SCS
Roadmap
• Introduction – Motivation • Unsupervised
– static graphs – Time-evolving graphs
• ‘copyCatch’ in FaceBook • Patterns of IAT • [tensors]
• Supervised: BP • Conclusions 15-826 Copyright: C. Faloutsos (2016) 55
CMU SCS Anomaly detection in time-
evolving graphs
• Anomalous communities in phone call data: – European country, 4M clients, data over 2 weeks
~200 calls to EACH receiver on EACH day!
1 caller 5 receivers 4 days of activity
=
15-826 Copyright: C. Faloutsos (2016) 56
Faloutsos 15826
15
CMU SCS Anomaly detection in time-
evolving graphs
• Anomalous communities in phone call data: – European country, 4M clients, data over 2 weeks
~200 calls to EACH receiver on EACH day!
1 caller 5 receivers 4 days of activity
=
15-826 Copyright: C. Faloutsos (2016) 57
CMU SCS Anomaly detection in time-
evolving graphs
• Anomalous communities in phone call data: – European country, 4M clients, data over 2 weeks
~200 calls to EACH receiver on EACH day!
=
Miguel Araujo, Spiros Papadimitriou, Stephan Günnemann, Christos Faloutsos, Prithwish Basu, Ananthram Swami, Evangelos Papalexakis, Danai Koutra. Com2: Fast Automatic Discovery of Temporal (Comet) Communities. PAKDD 2014, Tainan, Taiwan.
15-826 Copyright: C. Faloutsos (2016) 58
CMU SCS
Roadmap
• Introduction – Motivation • Unsupervised
– static graphs – Time-evolving graphs
• Supervised: BP – Ebay fraud (‘NetProbe’) – Malware (Polonium)
• Conclusions
15-826 Copyright: C. Faloutsos (2016) 59
CMU SCS
E-bay Fraud detection
w/ Polo Chau & Shashank Pandit, CMU [www’07]
15-826 Copyright: C. Faloutsos (2016) 60
Faloutsos 15826
16
CMU SCS
E-bay Fraud detection
15-826 Copyright: C. Faloutsos (2016) 61
CMU SCS
E-bay Fraud detection
15-826 Copyright: C. Faloutsos (2016) 62
CMU SCS
E-bay Fraud detection - NetProbe
15-826 Copyright: C. Faloutsos (2016) 63
CMU SCS
Popular press
And less desirable attention: • E-mail from ‘Belgium police’ (‘copy of
your code?’) 15-826 Copyright: C. Faloutsos (2016) 64
Faloutsos 15826
17
CMU SCS
Roadmap
• Introduction – Motivation • Part#1: Patterns in graphs
– Patterns – Anomaly / fraud detection
• CopyCatch • Spectral methods (‘fBox’) • Belief Propagation; antivirus app
• Part#2: time-evolving graphs; tensors • Conclusions 15-826 Copyright: C. Faloutsos (2016) 65
CMU SCS
Roadmap
• Introduction – Motivation • Unsupervised
– static graphs – Time-evolving graphs
• Supervised: BP – Ebay fraud (‘NetProbe’) – Malware (Polonium)
• Conclusions
15-826 Copyright: C. Faloutsos (2016) 66
CMU SCS
PoloChauMachineLearningDept
CareyNachenbergVicePresident&Fellow
JeffreyWilhelmPrincipalSo9wareEngineer
AdamWrightSo9wareEngineer
Prof.ChristosFaloutsosComputerScienceDept
Polonium:Tera-ScaleGraphMiningandInferenceforMalwareDetecCon
PATENTPENDING
SDM 2011, Mesa, Arizona
CMU SCS
Polonium:TheData60+terabytesofdataanonymouslycontributedbyparCcipantsofworldwideNortonCommunityWatchprogram
50+millionmachines900+millionexecutablefiles
Constructedamachine-filebiparCtegraph(0.2TB+)
1billionnodes(machinesandfiles)37billionedges
15-826 Copyright: C. Faloutsos (2016) 68
Faloutsos 15826
18
CMU SCS
Polonium:KeyIdeas• UseBeliefPropagaContopropagatedomainknowledgeinmachine-filegraphtodetectmalware
• Use“guilt-by-associaCon”(i.e.,homophily)– E.g.,filesthatappearonmachineswithmanybadfilesaremorelikelytobebad
• Scalability:handles37billion-edgegraph
15-826 Copyright: C. Faloutsos (2016) 69
CMU SCS
Polonium:One-InteracDonResults
84.9%TruePosiCveRate1%FalsePosiCveRate
True Positive Rate %ofmalware
correctlyidenCfied
Ideal
False Positive Rate %ofnon-malwarewronglylabeledasmalware15-826 Copyright: C. Faloutsos (2016) 70
CMU SCS
Roadmap
• Introduction – Motivation • Unsupervised • Supervised: BP (Belief Propagation)
– Ebay fraud (‘NetProbe’) – Malware (Polonium) – Intuition behind BP
• Conclusions
15-826 Copyright: C. Faloutsos (2016) 71
CMU SCS
UnifyingGuilt-by-AssociationApproaches:TheoremsandFastAlgorithms
Danai Koutra U Kang
Hsing-Kuo Kenneth Pao
Tai-You Ke Duen Horng (Polo) Chau
Christos Faloutsos
ECML PKDD, 5-9 September 2011, Athens, Greece
Faloutsos 15826
19
CMU SCS Problem Definition:
GBA techniques
Given: Graph; & few labeled nodes Find: labels of rest (assuming network effects)
15-826 Copyright: C. Faloutsos (2016) 73
CMU SCS
Are they related? • RWR (Random Walk with Restarts)
– google’s pageRank (‘if my friends are important, I’m important, too’)
• SSL (Semi-supervised learning) – minimize the differences among neighbors
• BP (Belief propagation) – send messages to neighbors, on what you
believe about them
15-826 Copyright: C. Faloutsos (2016) 74
CMU SCS
Are they related? • RWR (Random Walk with Restarts)
– google’s pageRank (‘if my friends are important, I’m important, too’)
• SSL (Semi-supervised learning) – minimize the differences among neighbors
• BP (Belief propagation) – send messages to neighbors, on what you
believe about them
YES!
15-826 Copyright: C. Faloutsos (2016) 75
CMU SCS
Correspondence of Methods
Method Matrix Unknown known RWR [I – c AD-1] × x = (1-c)ySSL [I + a(D - A)] × x = y
FABP [I + a D - c’A] × bh = φh
0 1 0 1 0 1 0 1 0
? 0 1 1
1 1
1
d1 d2 d3 final
labels/ beliefs
prior labels/ beliefs
adjacency matrix
15-826 Copyright: C. Faloutsos (2016) 76
Faloutsos 15826
20
CMU SCS
Cast
Akoglu, Leman
Chau, Polo
Kang, U
Prakash, Aditya
Koutra, Danai
Beutel, Alex
Papalexakis, Vagelis
Shah, Neil
Lee, Jay Yoon
Araujo, Miguel
15-826 Copyright: C. Faloutsos (2016) 77
CMU SCS
CONCLUSION#1 – Big data • Patterns Anomalies
• Large datasets reveal patterns/outliers that are invisible otherwise
15-826 Copyright: C. Faloutsos (2016) 78
CMU SCS
Conclusions, cont’d • Spectral/tensor methods: spot
lockstep behavior
• OddBall: strange, single nodes
• BP: good for the supervised setting (ie., some labels are given)
15-826 Copyright: C. Faloutsos (2016) 79