Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
U Kang
Advanced Data Mining
Anomaly Detection
U KangSeoul National Univeristy
U Kang
In This Lecture
Anomaly Detection Graph Structure Based Method Random Walk Based Method
U Kang
Outline
OverviewGraph Structure Based MethodRandom Walk Based MethodConclusion
U Kang
Data Mining
Data mining: find patterns and anomalies To spot anomalies, we have to discover
patterns
U Kang
Data Mining
Data mining: find patterns and anomalies To spot anomalies, we have to discover
patterns Large datasets reveal patterns and anomalies that
may be invisible otherwise
U Kang
Anomaly Detection
Anomaly detection Find suspicious data points which deviate
significantly from normal data
Anomaly detection in graph Find “strange” node in graph
U Kang
Anomaly Detection
Applications Network intrusion detection: find suspicious
attackers (e.g. DDoS attack, spammer, etc.) Call network : find heavy telemarketer Social network : spot people adding friends
indiscriminately in “popularity contest” Credit card fraud (the list continues..)
U Kang
Anomaly Detection
More Applications Campaign donation irregularity Extremely cross-disciplinary authors in an author-
paper graph Electronic auction fraud
U Kang
Overview
We will look at two methods for anomaly detection in graphs Graph Structure Based Method Random Walk Based Method
U Kang
Outline
OverviewGraph Structure Based MethodRandom Walk Based MethodConclusion
L. Akoglu, M. McGlohon, C. Faloutsos. OddBall: Spotting Anomalies in Weighted Graphs. PAKDD, 2012
U Kang
Problem Definition
Given: a weighted and unlabeled graph,
Q1: how can we spot strange, abnormal, extreme nodes?
Q2 : how can we explain why the spotted nodes are anomalous?
U Kang
OddBall: approach
For each node Extract “ego-net” (=1 step neighborhood) Extract features (#edges, total weight, etc.)
Features that could yield “laws” Features fast to compute and interpret
Detect patterns Regularities
Detect anomalies Deviate significantly
from patternsAnomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy
U Kang
What is Odd?
Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy
U Kang
Main Idea
For each egonet, extract features Find “rules” in features
Anomalies deviate significantly from the rules
U Kang
Which Features?
Ni : # of neighbors (degree) of ego i Ei : # of edges in egonet i
Wi : total weight of egonet i λw,i : principal eigenvalue of the weighted
adjacency matrix of egonet i
Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy
U Kang
Why Principal Eigenvalue?
N: #neighbors, W: total weight
Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy
U Kang
OddBall: pattern #1
discussion group,“rank boosting”, etc.
telemarketer, spammer,port scanner, “popularitycontests”, etc.
# neighbors N
# ed
ges E
Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy
U Kang
OddBall: pattern #2
Uniform, robot-like behavior
# edges E
tota
l wei
ght W
high $ vs. #accounts,high $ vs. #donors, etc
Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy
U Kang
OddBall: pattern #3
total weight W
larg
est e
igen
valu
e
Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy
U Kang
OddBall: anomaly detection
‣ Can tell what type of anomaly a node belongs to
‣ Can quantify “anomalous-ness” of nodes using score
scoredist = distance to fitting linescoreoutl = outlier-ness score (e.g. LOF)socre = func(scoredist, scoreoutl)
Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy
U Kang
OddBall: datasets
Bipartite graphs |V| |E|1. FEC Don2Com 1.6M 2M2. FEC Com2Cand 6K 125K3. DBLP Auth2Conf 21K 1M
Unipartite graphs |V| |E|4. BlogNet 27K 126K5. PostNet 223K 217K6. Enron 36K 183K7. AS peering 11K 8K
Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy
U Kang
OddBall at work (Posts)
# citations
# cr
oss-
citat
ions
223K posts217K citations
http://instapundit.com/archives/025235.phphttp://www.sizemore.co.
uk/2005/08/i-feel-some-movies-coming-on.html
POSTS
Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy
U Kang
OddBall at work (FEC)
# checks
$
6K candidates125K checks
COM2CANDIDATES
Russo,Aaron
Snyder,James E. Jr
Kerry,John F.
https://upload.wikimedia.org/wikipedia/commons/6/62/John_F._Kerry.jpghttps://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Aaron_russo-cannes.jpg/220px-Aaron_russo-cannes.jpg
Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy
https://upload.wikimedia.org/wikipedia/commons/6/62/John_F._Kerry.jpghttps://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Aaron_russo-cannes.jpg/220px-Aaron_russo-cannes.jpg
U Kang
OddBall at work (DBLP)
# publications
λ wAUTHORS(AUTH2CONF)
AverillM. Law
ToshioFukuda
Wei Li
Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy
U Kang
Outline
OverviewGraph Structure Based MethodRandom Walk Based MethodConclusion
J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos. Neighborhood formation and anomaly detection in bipartite graphs. ICDM, 2005
U Kang
Anomalies in Bipartite Graphs
Q1. Neighborhood formation (NF) Given a query node q in V1, what are
the relevance scores of all the nodes in V1 to q?
Q2. Anomaly detection (AD) Given a query node q in V1, what are
the normality scores for nodes in V2that link to q?
.3
.2
.05
.01
.002
.01
q
A
B
C
D
E
F
G
V1 V2
.05
.25
.25
U Kang
Examples of Bipartite Graphs
Publication network Author-paper
P2P network User-file
Recommendation User-product
Stock market Stock-trader
V1 V2a1
a2
a3
a4
a5
ak
t1
t2
t3
t4
t5
tn
E
Sun, Jimeng, et al. "Neighborhood formation and anomaly detection in bipartite graphs." Fifth IEEE International Conference on Data Mining (ICDM'05). IEEE, 2005.
U Kang
1) Neighborhood Formulation
Main idea Compute the Random Walk with
Restart score from query node q Steady state probability = relevance
.3
.2
.05
.01
.002
.01
q
A
B
C
D
E
F
G
V1 V2
U Kang
1) Neighborhood Formulation
Exact Neighborhood Formulation (NF) Exact RWR score
Approximate NF Partition the original graph into pieces by METIS Compute similarities only on the partition
containing the query node
U Kang
2) Anomaly Detection
Main idea: to compute anomaly score of t Compute pairwise “relevance”
scores for the neighbors of t Compute mean of the relevance
scorest
S
S
t
U Kang
Experiment
Dataset: DBLP Conf-Auth DBLP Author-Paper IMDB movie-actor
Questions: Q1) What are the discoveries? Q2) Anomaly detection quality?
U Kang
1) NF discovery
Sun, Jimeng, et al. "Neighborhood formation and anomaly detection in bipartite graphs." Fifth IEEE International Conference on Data Mining (ICDM'05). IEEE, 2005.
(a) ICDM (c) Robert DeNiro
U Kang
2) Anomaly Detection Quality
Setting: injected 100 random nodes connecting high degree nodes
Sun, Jimeng, et al. "Neighborhood formation and anomaly detection in bipartite graphs." Fifth IEEE International Conference on Data Mining (ICDM'05). IEEE, 2005.
• Normality scores between genuine and injected nodes across 3 datasets
U Kang
Outline
OverviewGraph Structure Based MethodRandom Walk Based MethodConclusion
U Kang
Conclusion
Anomaly detection Find suspicious data points which deviate
significantly from normal data Anomaly detection in graphs
Graph Structure Based Method Random Walk Based Method
Neighborhood Formulation (NF) Anomaly detection using NF
U Kang
Questions?