36
U Kang Advanced Data Mining Anomaly Detection U Kang Seoul National Univeristy

Advanced Data Miningukang/courses/20F-ADM/L24... · 2020. 12. 1. · Conference on Data Mining (ICDM'05). IEEE, 2005. • Normality scores between genuine and injected nodes across

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

  • U Kang

    Advanced Data Mining

    Anomaly Detection

    U KangSeoul National Univeristy

  • U Kang

    In This Lecture

    Anomaly Detection Graph Structure Based Method Random Walk Based Method

  • U Kang

    Outline

    OverviewGraph Structure Based MethodRandom Walk Based MethodConclusion

  • U Kang

    Data Mining

    Data mining: find patterns and anomalies To spot anomalies, we have to discover

    patterns

  • U Kang

    Data Mining

    Data mining: find patterns and anomalies To spot anomalies, we have to discover

    patterns Large datasets reveal patterns and anomalies that

    may be invisible otherwise

  • U Kang

    Anomaly Detection

    Anomaly detection Find suspicious data points which deviate

    significantly from normal data

    Anomaly detection in graph Find “strange” node in graph

  • U Kang

    Anomaly Detection

    Applications Network intrusion detection: find suspicious

    attackers (e.g. DDoS attack, spammer, etc.) Call network : find heavy telemarketer Social network : spot people adding friends

    indiscriminately in “popularity contest” Credit card fraud (the list continues..)

  • U Kang

    Anomaly Detection

    More Applications Campaign donation irregularity Extremely cross-disciplinary authors in an author-

    paper graph Electronic auction fraud

  • U Kang

    Overview

    We will look at two methods for anomaly detection in graphs Graph Structure Based Method Random Walk Based Method

  • U Kang

    Outline

    OverviewGraph Structure Based MethodRandom Walk Based MethodConclusion

    L. Akoglu, M. McGlohon, C. Faloutsos. OddBall: Spotting Anomalies in Weighted Graphs. PAKDD, 2012

  • U Kang

    Problem Definition

    Given: a weighted and unlabeled graph,

    Q1: how can we spot strange, abnormal, extreme nodes?

    Q2 : how can we explain why the spotted nodes are anomalous?

  • U Kang

    OddBall: approach

    For each node Extract “ego-net” (=1 step neighborhood) Extract features (#edges, total weight, etc.)

    Features that could yield “laws” Features fast to compute and interpret

    Detect patterns Regularities

    Detect anomalies Deviate significantly

    from patternsAnomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy

  • U Kang

    What is Odd?

    Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy

  • U Kang

    Main Idea

    For each egonet, extract features Find “rules” in features

    Anomalies deviate significantly from the rules

  • U Kang

    Which Features?

    Ni : # of neighbors (degree) of ego i Ei : # of edges in egonet i

    Wi : total weight of egonet i λw,i : principal eigenvalue of the weighted

    adjacency matrix of egonet i

    Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy

  • U Kang

    Why Principal Eigenvalue?

    N: #neighbors, W: total weight

    Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy

  • U Kang

    OddBall: pattern #1

    discussion group,“rank boosting”, etc.

    telemarketer, spammer,port scanner, “popularitycontests”, etc.

    # neighbors N

    # ed

    ges E

    Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy

  • U Kang

    OddBall: pattern #2

    Uniform, robot-like behavior

    # edges E

    tota

    l wei

    ght W

    high $ vs. #accounts,high $ vs. #donors, etc

    Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy

  • U Kang

    OddBall: pattern #3

    total weight W

    larg

    est e

    igen

    valu

    e

    Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy

  • U Kang

    OddBall: anomaly detection

    ‣ Can tell what type of anomaly a node belongs to

    ‣ Can quantify “anomalous-ness” of nodes using score

    scoredist = distance to fitting linescoreoutl = outlier-ness score (e.g. LOF)socre = func(scoredist, scoreoutl)

    Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy

  • U Kang

    OddBall: datasets

    Bipartite graphs |V| |E|1. FEC Don2Com 1.6M 2M2. FEC Com2Cand 6K 125K3. DBLP Auth2Conf 21K 1M

    Unipartite graphs |V| |E|4. BlogNet 27K 126K5. PostNet 223K 217K6. Enron 36K 183K7. AS peering 11K 8K

    Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy

  • U Kang

    OddBall at work (Posts)

    # citations

    # cr

    oss-

    citat

    ions

    223K posts217K citations

    http://instapundit.com/archives/025235.phphttp://www.sizemore.co.

    uk/2005/08/i-feel-some-movies-coming-on.html

    POSTS

    Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy

  • U Kang

    OddBall at work (FEC)

    # checks

    $

    6K candidates125K checks

    COM2CANDIDATES

    Russo,Aaron

    Snyder,James E. Jr

    Kerry,John F.

    https://upload.wikimedia.org/wikipedia/commons/6/62/John_F._Kerry.jpghttps://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Aaron_russo-cannes.jpg/220px-Aaron_russo-cannes.jpg

    Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy

    https://upload.wikimedia.org/wikipedia/commons/6/62/John_F._Kerry.jpghttps://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Aaron_russo-cannes.jpg/220px-Aaron_russo-cannes.jpg

  • U Kang

    OddBall at work (DBLP)

    # publications

    λ wAUTHORS(AUTH2CONF)

    AverillM. Law

    ToshioFukuda

    Wei Li

    Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy

  • U Kang

    Outline

    OverviewGraph Structure Based MethodRandom Walk Based MethodConclusion

    J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos. Neighborhood formation and anomaly detection in bipartite graphs. ICDM, 2005

  • U Kang

    Anomalies in Bipartite Graphs

    Q1. Neighborhood formation (NF) Given a query node q in V1, what are

    the relevance scores of all the nodes in V1 to q?

    Q2. Anomaly detection (AD) Given a query node q in V1, what are

    the normality scores for nodes in V2that link to q?

    .3

    .2

    .05

    .01

    .002

    .01

    q

    A

    B

    C

    D

    E

    F

    G

    V1 V2

    .05

    .25

    .25

  • U Kang

    Examples of Bipartite Graphs

    Publication network Author-paper

    P2P network User-file

    Recommendation User-product

    Stock market Stock-trader

    V1 V2a1

    a2

    a3

    a4

    a5

    ak

    t1

    t2

    t3

    t4

    t5

    tn

    E

    Sun, Jimeng, et al. "Neighborhood formation and anomaly detection in bipartite graphs." Fifth IEEE International Conference on Data Mining (ICDM'05). IEEE, 2005.

  • U Kang

    1) Neighborhood Formulation

    Main idea Compute the Random Walk with

    Restart score from query node q Steady state probability = relevance

    .3

    .2

    .05

    .01

    .002

    .01

    q

    A

    B

    C

    D

    E

    F

    G

    V1 V2

  • U Kang

    1) Neighborhood Formulation

    Exact Neighborhood Formulation (NF) Exact RWR score

    Approximate NF Partition the original graph into pieces by METIS Compute similarities only on the partition

    containing the query node

  • U Kang

    2) Anomaly Detection

    Main idea: to compute anomaly score of t Compute pairwise “relevance”

    scores for the neighbors of t Compute mean of the relevance

    scorest

    S

    S

    t

  • U Kang

    Experiment

    Dataset: DBLP Conf-Auth DBLP Author-Paper IMDB movie-actor

    Questions: Q1) What are the discoveries? Q2) Anomaly detection quality?

  • U Kang

    1) NF discovery

    Sun, Jimeng, et al. "Neighborhood formation and anomaly detection in bipartite graphs." Fifth IEEE International Conference on Data Mining (ICDM'05). IEEE, 2005.

    (a) ICDM (c) Robert DeNiro

  • U Kang

    2) Anomaly Detection Quality

    Setting: injected 100 random nodes connecting high degree nodes

    Sun, Jimeng, et al. "Neighborhood formation and anomaly detection in bipartite graphs." Fifth IEEE International Conference on Data Mining (ICDM'05). IEEE, 2005.

    • Normality scores between genuine and injected nodes across 3 datasets

  • U Kang

    Outline

    OverviewGraph Structure Based MethodRandom Walk Based MethodConclusion

  • U Kang

    Conclusion

    Anomaly detection Find suspicious data points which deviate

    significantly from normal data Anomaly detection in graphs

    Graph Structure Based Method Random Walk Based Method

    Neighborhood Formulation (NF) Anomaly detection using NF

  • U Kang

    Questions?