80
Introduction Data Mining Methods Density-based anomaly Commute Time Approach Anomaly Detection: A Tutorial Theory and Applications Sanjay Chawla 1 Varun Chandola 2 1 School of Information Technologies University of Sydney NSW, Australia [email protected] 2 Computational Sciences and Engineering Division Oak Ridge National Laboratory Oak Ridge, TN, USA [email protected] December 14, 2011 Chawla and Chandola Anomaly Detection

Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

  • Upload
    others

  • View
    13

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Anomaly Detection A TutorialTheory and Applications

Sanjay Chawla1 Varun Chandola2

1School of Information TechnologiesUniversity of Sydney

NSW Australiachawlaitusydeduau

2Computational Sciences and Engineering DivisionOak Ridge National Laboratory

Oak Ridge TN USAchandolavornlgov

December 14 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Tutorial Outline

Introduction and Overview

Theory Statistical Methods Distance and Density Based Methods Addressing Scalability Anomalies in Complex Data Evaluation Methods

Applications Network Intrusion Detection Fraud Detection Epidemiological Studies Climate and Weather Data Analysis

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Anomaly Detection - Overview

In Data Mining anomaly or outlier detection is one of thefour tasks

Classification Clustering Pattern Mining Anomaly Detection

Historically detection of anomalies has led to the discovery ofnew theories Famous examples include

El Nino and Southern Oscillation Index (SOI) The discovery of the planet Neptune The use of fluoride in toothpaste

Anomalies often lead to ldquosurpriserdquo - a form of inferenceknown as abduction (different from induction and deduction)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Definition

Hawkins ldquoan outlier is an observation which deviates somuch from other observations as to arouse suspicions that itwas generated by a different mechanismrdquo [15]

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

dimension 1

dim

en

sio

n 2

p1

p2

p3

p4

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Statistical Methods

Lets begin with the univariate Normal distribution

f(x) =1

(2πσ2)12

eminus[(xminusmicro)σ]22

Notice exponent measures square of deviation from mean andnormalized by standard deviation(

xminus microσ

)2

= (xminus micro)(σ2)minus1(xminus micro)

For d dimension the exponent is called (square of)Mahalanobis distance

(xminus micro)primeΣminus1(xminus micro)

where Σ is the dtimes d variance-covariance matrix

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Anomaly Detection with Mahalanobis Distance

The key observation is that if data x follows a d dimensionalGaussian distribution then

(xminus micro)primeΣminus1(xminus micro) asymp χ2d

Anomalies can be found in the tail of the distribution

There are three major weaknesses of the above approach Data may not follow a Normal distribution or be a mixture of

distributions Both mean and variance of χ2 is d For high-dimensional data

this is a problem Mean and thus variance are extremely sensitive to outliers -and

we are using them to find anomalies - often leads to falsenegatives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Mahalanobis vs Euclidean Distance

Mahalanobis normalizes for variance

minus10 minus5 0 5 10minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

10

14

61

16

29

Point Pairs Mahalanobis Euclidean

(1429) 507 1178

(1661) 483 684

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based anomalies

Intuition A data point which is far away from its nearestneighbors is a candidate anomaly

Several definitions which capture the above intuition

DB(pD) anomaly [20] an object o in a data set T is aDB(pD) anomaly if at least a fraction p of objects in T havedistances greater than D from o

Generalizes the notion of ldquothree standard deviation from themeanrdquo

This definition had a huge influence on subsequentdevelopment in outlier detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

DB(pD) outlier

To build some intuition consider data generated from theNormal distribution N(0 1) Then if O is a DB(pD) outlier

1

(2π)12

int O+D

OminusDeminus

x2

2 dx le 1minus p

Example If O is 3 (3 standard deviations away from themean) then it is a DB(01 0999) outlier

Thus for particular settings of D and p DB(pD) capturesstandard outliers

But much more general (eg any distance metric)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based methods (DB(kN)

DB(kN) anomaly [28] top N data instances whosedistances to its k-th nearest neighbor are largest

Several advantages Ranking for anomalies is more intuitiveSetting of parameters generally easier

A Simple Nested Loop (SNL) algorithm can be used to selectthe top N DB(kN) outliers Time complexity is O(n2d)where n is the database size and d is the dimensionality

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Pruning rule

DB(kN) anomaly [3] a data instance is not an anomaly ifits distance to its k-th current nearest neighbor is less thanthe score of the weakest anomaly among top N anomaliesfound so far

A large number of non-anomalies can be pruned withoutcarrying out a full data search

Complexity nearly O(n)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Examples of pruning technique

Non-anomalies are pruned earlier

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Strengths and weaknesses - Distance-based techniques

Do not make any assumption about the distribution of thedata

Scalable for large dataset (O(n))

Capable of finding only global anomalies

Can lead to non-intuitive results in Top-k situations

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

dimension 1

dim

en

sio

n 2

p1

p2

p3

p4

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Density-based anomaly

Calculate the density of an object based on the density of itsk nearest neighbours

density(p) = 1

sum

qisinNk(p)distk(p q)

|Nk(p)|

relative-density(p) =

density(p)1

|Nk(p)|sum

qisinNk(p)density(q)

anomaly-score(p) =1

relative-density(p)

LOF indicates a degree of local outlier-ness [6]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Strengths and Weaknesses

Can detect global and local anomalies

Cannot use pruning technique and has a complexity of O(n2)

Require a method combining the strengths of distance anddensity based approaches A distance based approach whichcan capture density

C

C

O

O1

1

2

2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Commute time

Commute time between i and j is the expected number ofsteps that a random walk starting at i will take to reach jonce and go back to i for the first time

Commute time can capture both the distance between pointsand the data densities

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Computation of commute time

Commute time can be computed using graph Laplacian matrixL

cij = VG(ei minus ej)TL+(ei minus ej)L+ pseudo-inverse of LVG graph volumeei i-th column of the identity matrix

Commute time is Euclidean distance in the space spanned byeigenvectors of L

cij = VG[(Sminus12V T )(ei minus ej)]T [(Sminus12V T )(ei minus ej)]

V S eigenvectors and eigenvalues of L

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 2: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Tutorial Outline

Introduction and Overview

Theory Statistical Methods Distance and Density Based Methods Addressing Scalability Anomalies in Complex Data Evaluation Methods

Applications Network Intrusion Detection Fraud Detection Epidemiological Studies Climate and Weather Data Analysis

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Anomaly Detection - Overview

In Data Mining anomaly or outlier detection is one of thefour tasks

Classification Clustering Pattern Mining Anomaly Detection

Historically detection of anomalies has led to the discovery ofnew theories Famous examples include

El Nino and Southern Oscillation Index (SOI) The discovery of the planet Neptune The use of fluoride in toothpaste

Anomalies often lead to ldquosurpriserdquo - a form of inferenceknown as abduction (different from induction and deduction)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Definition

Hawkins ldquoan outlier is an observation which deviates somuch from other observations as to arouse suspicions that itwas generated by a different mechanismrdquo [15]

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

dimension 1

dim

en

sio

n 2

p1

p2

p3

p4

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Statistical Methods

Lets begin with the univariate Normal distribution

f(x) =1

(2πσ2)12

eminus[(xminusmicro)σ]22

Notice exponent measures square of deviation from mean andnormalized by standard deviation(

xminus microσ

)2

= (xminus micro)(σ2)minus1(xminus micro)

For d dimension the exponent is called (square of)Mahalanobis distance

(xminus micro)primeΣminus1(xminus micro)

where Σ is the dtimes d variance-covariance matrix

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Anomaly Detection with Mahalanobis Distance

The key observation is that if data x follows a d dimensionalGaussian distribution then

(xminus micro)primeΣminus1(xminus micro) asymp χ2d

Anomalies can be found in the tail of the distribution

There are three major weaknesses of the above approach Data may not follow a Normal distribution or be a mixture of

distributions Both mean and variance of χ2 is d For high-dimensional data

this is a problem Mean and thus variance are extremely sensitive to outliers -and

we are using them to find anomalies - often leads to falsenegatives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Mahalanobis vs Euclidean Distance

Mahalanobis normalizes for variance

minus10 minus5 0 5 10minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

10

14

61

16

29

Point Pairs Mahalanobis Euclidean

(1429) 507 1178

(1661) 483 684

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based anomalies

Intuition A data point which is far away from its nearestneighbors is a candidate anomaly

Several definitions which capture the above intuition

DB(pD) anomaly [20] an object o in a data set T is aDB(pD) anomaly if at least a fraction p of objects in T havedistances greater than D from o

Generalizes the notion of ldquothree standard deviation from themeanrdquo

This definition had a huge influence on subsequentdevelopment in outlier detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

DB(pD) outlier

To build some intuition consider data generated from theNormal distribution N(0 1) Then if O is a DB(pD) outlier

1

(2π)12

int O+D

OminusDeminus

x2

2 dx le 1minus p

Example If O is 3 (3 standard deviations away from themean) then it is a DB(01 0999) outlier

Thus for particular settings of D and p DB(pD) capturesstandard outliers

But much more general (eg any distance metric)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based methods (DB(kN)

DB(kN) anomaly [28] top N data instances whosedistances to its k-th nearest neighbor are largest

Several advantages Ranking for anomalies is more intuitiveSetting of parameters generally easier

A Simple Nested Loop (SNL) algorithm can be used to selectthe top N DB(kN) outliers Time complexity is O(n2d)where n is the database size and d is the dimensionality

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Pruning rule

DB(kN) anomaly [3] a data instance is not an anomaly ifits distance to its k-th current nearest neighbor is less thanthe score of the weakest anomaly among top N anomaliesfound so far

A large number of non-anomalies can be pruned withoutcarrying out a full data search

Complexity nearly O(n)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Examples of pruning technique

Non-anomalies are pruned earlier

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Strengths and weaknesses - Distance-based techniques

Do not make any assumption about the distribution of thedata

Scalable for large dataset (O(n))

Capable of finding only global anomalies

Can lead to non-intuitive results in Top-k situations

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

dimension 1

dim

en

sio

n 2

p1

p2

p3

p4

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Density-based anomaly

Calculate the density of an object based on the density of itsk nearest neighbours

density(p) = 1

sum

qisinNk(p)distk(p q)

|Nk(p)|

relative-density(p) =

density(p)1

|Nk(p)|sum

qisinNk(p)density(q)

anomaly-score(p) =1

relative-density(p)

LOF indicates a degree of local outlier-ness [6]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Strengths and Weaknesses

Can detect global and local anomalies

Cannot use pruning technique and has a complexity of O(n2)

Require a method combining the strengths of distance anddensity based approaches A distance based approach whichcan capture density

C

C

O

O1

1

2

2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Commute time

Commute time between i and j is the expected number ofsteps that a random walk starting at i will take to reach jonce and go back to i for the first time

Commute time can capture both the distance between pointsand the data densities

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Computation of commute time

Commute time can be computed using graph Laplacian matrixL

cij = VG(ei minus ej)TL+(ei minus ej)L+ pseudo-inverse of LVG graph volumeei i-th column of the identity matrix

Commute time is Euclidean distance in the space spanned byeigenvectors of L

cij = VG[(Sminus12V T )(ei minus ej)]T [(Sminus12V T )(ei minus ej)]

V S eigenvectors and eigenvalues of L

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 3: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Anomaly Detection - Overview

In Data Mining anomaly or outlier detection is one of thefour tasks

Classification Clustering Pattern Mining Anomaly Detection

Historically detection of anomalies has led to the discovery ofnew theories Famous examples include

El Nino and Southern Oscillation Index (SOI) The discovery of the planet Neptune The use of fluoride in toothpaste

Anomalies often lead to ldquosurpriserdquo - a form of inferenceknown as abduction (different from induction and deduction)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Definition

Hawkins ldquoan outlier is an observation which deviates somuch from other observations as to arouse suspicions that itwas generated by a different mechanismrdquo [15]

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

dimension 1

dim

en

sio

n 2

p1

p2

p3

p4

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Statistical Methods

Lets begin with the univariate Normal distribution

f(x) =1

(2πσ2)12

eminus[(xminusmicro)σ]22

Notice exponent measures square of deviation from mean andnormalized by standard deviation(

xminus microσ

)2

= (xminus micro)(σ2)minus1(xminus micro)

For d dimension the exponent is called (square of)Mahalanobis distance

(xminus micro)primeΣminus1(xminus micro)

where Σ is the dtimes d variance-covariance matrix

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Anomaly Detection with Mahalanobis Distance

The key observation is that if data x follows a d dimensionalGaussian distribution then

(xminus micro)primeΣminus1(xminus micro) asymp χ2d

Anomalies can be found in the tail of the distribution

There are three major weaknesses of the above approach Data may not follow a Normal distribution or be a mixture of

distributions Both mean and variance of χ2 is d For high-dimensional data

this is a problem Mean and thus variance are extremely sensitive to outliers -and

we are using them to find anomalies - often leads to falsenegatives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Mahalanobis vs Euclidean Distance

Mahalanobis normalizes for variance

minus10 minus5 0 5 10minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

10

14

61

16

29

Point Pairs Mahalanobis Euclidean

(1429) 507 1178

(1661) 483 684

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based anomalies

Intuition A data point which is far away from its nearestneighbors is a candidate anomaly

Several definitions which capture the above intuition

DB(pD) anomaly [20] an object o in a data set T is aDB(pD) anomaly if at least a fraction p of objects in T havedistances greater than D from o

Generalizes the notion of ldquothree standard deviation from themeanrdquo

This definition had a huge influence on subsequentdevelopment in outlier detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

DB(pD) outlier

To build some intuition consider data generated from theNormal distribution N(0 1) Then if O is a DB(pD) outlier

1

(2π)12

int O+D

OminusDeminus

x2

2 dx le 1minus p

Example If O is 3 (3 standard deviations away from themean) then it is a DB(01 0999) outlier

Thus for particular settings of D and p DB(pD) capturesstandard outliers

But much more general (eg any distance metric)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based methods (DB(kN)

DB(kN) anomaly [28] top N data instances whosedistances to its k-th nearest neighbor are largest

Several advantages Ranking for anomalies is more intuitiveSetting of parameters generally easier

A Simple Nested Loop (SNL) algorithm can be used to selectthe top N DB(kN) outliers Time complexity is O(n2d)where n is the database size and d is the dimensionality

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Pruning rule

DB(kN) anomaly [3] a data instance is not an anomaly ifits distance to its k-th current nearest neighbor is less thanthe score of the weakest anomaly among top N anomaliesfound so far

A large number of non-anomalies can be pruned withoutcarrying out a full data search

Complexity nearly O(n)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Examples of pruning technique

Non-anomalies are pruned earlier

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Strengths and weaknesses - Distance-based techniques

Do not make any assumption about the distribution of thedata

Scalable for large dataset (O(n))

Capable of finding only global anomalies

Can lead to non-intuitive results in Top-k situations

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

dimension 1

dim

en

sio

n 2

p1

p2

p3

p4

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Density-based anomaly

Calculate the density of an object based on the density of itsk nearest neighbours

density(p) = 1

sum

qisinNk(p)distk(p q)

|Nk(p)|

relative-density(p) =

density(p)1

|Nk(p)|sum

qisinNk(p)density(q)

anomaly-score(p) =1

relative-density(p)

LOF indicates a degree of local outlier-ness [6]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Strengths and Weaknesses

Can detect global and local anomalies

Cannot use pruning technique and has a complexity of O(n2)

Require a method combining the strengths of distance anddensity based approaches A distance based approach whichcan capture density

C

C

O

O1

1

2

2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Commute time

Commute time between i and j is the expected number ofsteps that a random walk starting at i will take to reach jonce and go back to i for the first time

Commute time can capture both the distance between pointsand the data densities

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Computation of commute time

Commute time can be computed using graph Laplacian matrixL

cij = VG(ei minus ej)TL+(ei minus ej)L+ pseudo-inverse of LVG graph volumeei i-th column of the identity matrix

Commute time is Euclidean distance in the space spanned byeigenvectors of L

cij = VG[(Sminus12V T )(ei minus ej)]T [(Sminus12V T )(ei minus ej)]

V S eigenvectors and eigenvalues of L

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 4: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Definition

Hawkins ldquoan outlier is an observation which deviates somuch from other observations as to arouse suspicions that itwas generated by a different mechanismrdquo [15]

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

dimension 1

dim

en

sio

n 2

p1

p2

p3

p4

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Statistical Methods

Lets begin with the univariate Normal distribution

f(x) =1

(2πσ2)12

eminus[(xminusmicro)σ]22

Notice exponent measures square of deviation from mean andnormalized by standard deviation(

xminus microσ

)2

= (xminus micro)(σ2)minus1(xminus micro)

For d dimension the exponent is called (square of)Mahalanobis distance

(xminus micro)primeΣminus1(xminus micro)

where Σ is the dtimes d variance-covariance matrix

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Anomaly Detection with Mahalanobis Distance

The key observation is that if data x follows a d dimensionalGaussian distribution then

(xminus micro)primeΣminus1(xminus micro) asymp χ2d

Anomalies can be found in the tail of the distribution

There are three major weaknesses of the above approach Data may not follow a Normal distribution or be a mixture of

distributions Both mean and variance of χ2 is d For high-dimensional data

this is a problem Mean and thus variance are extremely sensitive to outliers -and

we are using them to find anomalies - often leads to falsenegatives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Mahalanobis vs Euclidean Distance

Mahalanobis normalizes for variance

minus10 minus5 0 5 10minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

10

14

61

16

29

Point Pairs Mahalanobis Euclidean

(1429) 507 1178

(1661) 483 684

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based anomalies

Intuition A data point which is far away from its nearestneighbors is a candidate anomaly

Several definitions which capture the above intuition

DB(pD) anomaly [20] an object o in a data set T is aDB(pD) anomaly if at least a fraction p of objects in T havedistances greater than D from o

Generalizes the notion of ldquothree standard deviation from themeanrdquo

This definition had a huge influence on subsequentdevelopment in outlier detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

DB(pD) outlier

To build some intuition consider data generated from theNormal distribution N(0 1) Then if O is a DB(pD) outlier

1

(2π)12

int O+D

OminusDeminus

x2

2 dx le 1minus p

Example If O is 3 (3 standard deviations away from themean) then it is a DB(01 0999) outlier

Thus for particular settings of D and p DB(pD) capturesstandard outliers

But much more general (eg any distance metric)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based methods (DB(kN)

DB(kN) anomaly [28] top N data instances whosedistances to its k-th nearest neighbor are largest

Several advantages Ranking for anomalies is more intuitiveSetting of parameters generally easier

A Simple Nested Loop (SNL) algorithm can be used to selectthe top N DB(kN) outliers Time complexity is O(n2d)where n is the database size and d is the dimensionality

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Pruning rule

DB(kN) anomaly [3] a data instance is not an anomaly ifits distance to its k-th current nearest neighbor is less thanthe score of the weakest anomaly among top N anomaliesfound so far

A large number of non-anomalies can be pruned withoutcarrying out a full data search

Complexity nearly O(n)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Examples of pruning technique

Non-anomalies are pruned earlier

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Strengths and weaknesses - Distance-based techniques

Do not make any assumption about the distribution of thedata

Scalable for large dataset (O(n))

Capable of finding only global anomalies

Can lead to non-intuitive results in Top-k situations

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

dimension 1

dim

en

sio

n 2

p1

p2

p3

p4

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Density-based anomaly

Calculate the density of an object based on the density of itsk nearest neighbours

density(p) = 1

sum

qisinNk(p)distk(p q)

|Nk(p)|

relative-density(p) =

density(p)1

|Nk(p)|sum

qisinNk(p)density(q)

anomaly-score(p) =1

relative-density(p)

LOF indicates a degree of local outlier-ness [6]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Strengths and Weaknesses

Can detect global and local anomalies

Cannot use pruning technique and has a complexity of O(n2)

Require a method combining the strengths of distance anddensity based approaches A distance based approach whichcan capture density

C

C

O

O1

1

2

2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Commute time

Commute time between i and j is the expected number ofsteps that a random walk starting at i will take to reach jonce and go back to i for the first time

Commute time can capture both the distance between pointsand the data densities

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Computation of commute time

Commute time can be computed using graph Laplacian matrixL

cij = VG(ei minus ej)TL+(ei minus ej)L+ pseudo-inverse of LVG graph volumeei i-th column of the identity matrix

Commute time is Euclidean distance in the space spanned byeigenvectors of L

cij = VG[(Sminus12V T )(ei minus ej)]T [(Sminus12V T )(ei minus ej)]

V S eigenvectors and eigenvalues of L

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 5: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Statistical Methods

Lets begin with the univariate Normal distribution

f(x) =1

(2πσ2)12

eminus[(xminusmicro)σ]22

Notice exponent measures square of deviation from mean andnormalized by standard deviation(

xminus microσ

)2

= (xminus micro)(σ2)minus1(xminus micro)

For d dimension the exponent is called (square of)Mahalanobis distance

(xminus micro)primeΣminus1(xminus micro)

where Σ is the dtimes d variance-covariance matrix

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Anomaly Detection with Mahalanobis Distance

The key observation is that if data x follows a d dimensionalGaussian distribution then

(xminus micro)primeΣminus1(xminus micro) asymp χ2d

Anomalies can be found in the tail of the distribution

There are three major weaknesses of the above approach Data may not follow a Normal distribution or be a mixture of

distributions Both mean and variance of χ2 is d For high-dimensional data

this is a problem Mean and thus variance are extremely sensitive to outliers -and

we are using them to find anomalies - often leads to falsenegatives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Mahalanobis vs Euclidean Distance

Mahalanobis normalizes for variance

minus10 minus5 0 5 10minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

10

14

61

16

29

Point Pairs Mahalanobis Euclidean

(1429) 507 1178

(1661) 483 684

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based anomalies

Intuition A data point which is far away from its nearestneighbors is a candidate anomaly

Several definitions which capture the above intuition

DB(pD) anomaly [20] an object o in a data set T is aDB(pD) anomaly if at least a fraction p of objects in T havedistances greater than D from o

Generalizes the notion of ldquothree standard deviation from themeanrdquo

This definition had a huge influence on subsequentdevelopment in outlier detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

DB(pD) outlier

To build some intuition consider data generated from theNormal distribution N(0 1) Then if O is a DB(pD) outlier

1

(2π)12

int O+D

OminusDeminus

x2

2 dx le 1minus p

Example If O is 3 (3 standard deviations away from themean) then it is a DB(01 0999) outlier

Thus for particular settings of D and p DB(pD) capturesstandard outliers

But much more general (eg any distance metric)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based methods (DB(kN)

DB(kN) anomaly [28] top N data instances whosedistances to its k-th nearest neighbor are largest

Several advantages Ranking for anomalies is more intuitiveSetting of parameters generally easier

A Simple Nested Loop (SNL) algorithm can be used to selectthe top N DB(kN) outliers Time complexity is O(n2d)where n is the database size and d is the dimensionality

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Pruning rule

DB(kN) anomaly [3] a data instance is not an anomaly ifits distance to its k-th current nearest neighbor is less thanthe score of the weakest anomaly among top N anomaliesfound so far

A large number of non-anomalies can be pruned withoutcarrying out a full data search

Complexity nearly O(n)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Examples of pruning technique

Non-anomalies are pruned earlier

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Strengths and weaknesses - Distance-based techniques

Do not make any assumption about the distribution of thedata

Scalable for large dataset (O(n))

Capable of finding only global anomalies

Can lead to non-intuitive results in Top-k situations

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

dimension 1

dim

en

sio

n 2

p1

p2

p3

p4

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Density-based anomaly

Calculate the density of an object based on the density of itsk nearest neighbours

density(p) = 1

sum

qisinNk(p)distk(p q)

|Nk(p)|

relative-density(p) =

density(p)1

|Nk(p)|sum

qisinNk(p)density(q)

anomaly-score(p) =1

relative-density(p)

LOF indicates a degree of local outlier-ness [6]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Strengths and Weaknesses

Can detect global and local anomalies

Cannot use pruning technique and has a complexity of O(n2)

Require a method combining the strengths of distance anddensity based approaches A distance based approach whichcan capture density

C

C

O

O1

1

2

2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Commute time

Commute time between i and j is the expected number ofsteps that a random walk starting at i will take to reach jonce and go back to i for the first time

Commute time can capture both the distance between pointsand the data densities

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Computation of commute time

Commute time can be computed using graph Laplacian matrixL

cij = VG(ei minus ej)TL+(ei minus ej)L+ pseudo-inverse of LVG graph volumeei i-th column of the identity matrix

Commute time is Euclidean distance in the space spanned byeigenvectors of L

cij = VG[(Sminus12V T )(ei minus ej)]T [(Sminus12V T )(ei minus ej)]

V S eigenvectors and eigenvalues of L

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 6: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Anomaly Detection with Mahalanobis Distance

The key observation is that if data x follows a d dimensionalGaussian distribution then

(xminus micro)primeΣminus1(xminus micro) asymp χ2d

Anomalies can be found in the tail of the distribution

There are three major weaknesses of the above approach Data may not follow a Normal distribution or be a mixture of

distributions Both mean and variance of χ2 is d For high-dimensional data

this is a problem Mean and thus variance are extremely sensitive to outliers -and

we are using them to find anomalies - often leads to falsenegatives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Mahalanobis vs Euclidean Distance

Mahalanobis normalizes for variance

minus10 minus5 0 5 10minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

10

14

61

16

29

Point Pairs Mahalanobis Euclidean

(1429) 507 1178

(1661) 483 684

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based anomalies

Intuition A data point which is far away from its nearestneighbors is a candidate anomaly

Several definitions which capture the above intuition

DB(pD) anomaly [20] an object o in a data set T is aDB(pD) anomaly if at least a fraction p of objects in T havedistances greater than D from o

Generalizes the notion of ldquothree standard deviation from themeanrdquo

This definition had a huge influence on subsequentdevelopment in outlier detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

DB(pD) outlier

To build some intuition consider data generated from theNormal distribution N(0 1) Then if O is a DB(pD) outlier

1

(2π)12

int O+D

OminusDeminus

x2

2 dx le 1minus p

Example If O is 3 (3 standard deviations away from themean) then it is a DB(01 0999) outlier

Thus for particular settings of D and p DB(pD) capturesstandard outliers

But much more general (eg any distance metric)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based methods (DB(kN)

DB(kN) anomaly [28] top N data instances whosedistances to its k-th nearest neighbor are largest

Several advantages Ranking for anomalies is more intuitiveSetting of parameters generally easier

A Simple Nested Loop (SNL) algorithm can be used to selectthe top N DB(kN) outliers Time complexity is O(n2d)where n is the database size and d is the dimensionality

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Pruning rule

DB(kN) anomaly [3] a data instance is not an anomaly ifits distance to its k-th current nearest neighbor is less thanthe score of the weakest anomaly among top N anomaliesfound so far

A large number of non-anomalies can be pruned withoutcarrying out a full data search

Complexity nearly O(n)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Examples of pruning technique

Non-anomalies are pruned earlier

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Strengths and weaknesses - Distance-based techniques

Do not make any assumption about the distribution of thedata

Scalable for large dataset (O(n))

Capable of finding only global anomalies

Can lead to non-intuitive results in Top-k situations

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

dimension 1

dim

en

sio

n 2

p1

p2

p3

p4

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Density-based anomaly

Calculate the density of an object based on the density of itsk nearest neighbours

density(p) = 1

sum

qisinNk(p)distk(p q)

|Nk(p)|

relative-density(p) =

density(p)1

|Nk(p)|sum

qisinNk(p)density(q)

anomaly-score(p) =1

relative-density(p)

LOF indicates a degree of local outlier-ness [6]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Strengths and Weaknesses

Can detect global and local anomalies

Cannot use pruning technique and has a complexity of O(n2)

Require a method combining the strengths of distance anddensity based approaches A distance based approach whichcan capture density

C

C

O

O1

1

2

2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Commute time

Commute time between i and j is the expected number ofsteps that a random walk starting at i will take to reach jonce and go back to i for the first time

Commute time can capture both the distance between pointsand the data densities

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Computation of commute time

Commute time can be computed using graph Laplacian matrixL

cij = VG(ei minus ej)TL+(ei minus ej)L+ pseudo-inverse of LVG graph volumeei i-th column of the identity matrix

Commute time is Euclidean distance in the space spanned byeigenvectors of L

cij = VG[(Sminus12V T )(ei minus ej)]T [(Sminus12V T )(ei minus ej)]

V S eigenvectors and eigenvalues of L

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 7: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsOutline Anomaly Detection - Overview Definition Statistical Methods Anomaly Detection with Mahalanobis Distance Mahalanobis vs Euclidean Distance

Mahalanobis vs Euclidean Distance

Mahalanobis normalizes for variance

minus10 minus5 0 5 10minus10

minus8

minus6

minus4

minus2

0

2

4

6

8

10

14

61

16

29

Point Pairs Mahalanobis Euclidean

(1429) 507 1178

(1661) 483 684

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based anomalies

Intuition A data point which is far away from its nearestneighbors is a candidate anomaly

Several definitions which capture the above intuition

DB(pD) anomaly [20] an object o in a data set T is aDB(pD) anomaly if at least a fraction p of objects in T havedistances greater than D from o

Generalizes the notion of ldquothree standard deviation from themeanrdquo

This definition had a huge influence on subsequentdevelopment in outlier detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

DB(pD) outlier

To build some intuition consider data generated from theNormal distribution N(0 1) Then if O is a DB(pD) outlier

1

(2π)12

int O+D

OminusDeminus

x2

2 dx le 1minus p

Example If O is 3 (3 standard deviations away from themean) then it is a DB(01 0999) outlier

Thus for particular settings of D and p DB(pD) capturesstandard outliers

But much more general (eg any distance metric)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based methods (DB(kN)

DB(kN) anomaly [28] top N data instances whosedistances to its k-th nearest neighbor are largest

Several advantages Ranking for anomalies is more intuitiveSetting of parameters generally easier

A Simple Nested Loop (SNL) algorithm can be used to selectthe top N DB(kN) outliers Time complexity is O(n2d)where n is the database size and d is the dimensionality

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Pruning rule

DB(kN) anomaly [3] a data instance is not an anomaly ifits distance to its k-th current nearest neighbor is less thanthe score of the weakest anomaly among top N anomaliesfound so far

A large number of non-anomalies can be pruned withoutcarrying out a full data search

Complexity nearly O(n)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Examples of pruning technique

Non-anomalies are pruned earlier

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Strengths and weaknesses - Distance-based techniques

Do not make any assumption about the distribution of thedata

Scalable for large dataset (O(n))

Capable of finding only global anomalies

Can lead to non-intuitive results in Top-k situations

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

dimension 1

dim

en

sio

n 2

p1

p2

p3

p4

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Density-based anomaly

Calculate the density of an object based on the density of itsk nearest neighbours

density(p) = 1

sum

qisinNk(p)distk(p q)

|Nk(p)|

relative-density(p) =

density(p)1

|Nk(p)|sum

qisinNk(p)density(q)

anomaly-score(p) =1

relative-density(p)

LOF indicates a degree of local outlier-ness [6]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Strengths and Weaknesses

Can detect global and local anomalies

Cannot use pruning technique and has a complexity of O(n2)

Require a method combining the strengths of distance anddensity based approaches A distance based approach whichcan capture density

C

C

O

O1

1

2

2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Commute time

Commute time between i and j is the expected number ofsteps that a random walk starting at i will take to reach jonce and go back to i for the first time

Commute time can capture both the distance between pointsand the data densities

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Computation of commute time

Commute time can be computed using graph Laplacian matrixL

cij = VG(ei minus ej)TL+(ei minus ej)L+ pseudo-inverse of LVG graph volumeei i-th column of the identity matrix

Commute time is Euclidean distance in the space spanned byeigenvectors of L

cij = VG[(Sminus12V T )(ei minus ej)]T [(Sminus12V T )(ei minus ej)]

V S eigenvectors and eigenvalues of L

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 8: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based anomalies

Intuition A data point which is far away from its nearestneighbors is a candidate anomaly

Several definitions which capture the above intuition

DB(pD) anomaly [20] an object o in a data set T is aDB(pD) anomaly if at least a fraction p of objects in T havedistances greater than D from o

Generalizes the notion of ldquothree standard deviation from themeanrdquo

This definition had a huge influence on subsequentdevelopment in outlier detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

DB(pD) outlier

To build some intuition consider data generated from theNormal distribution N(0 1) Then if O is a DB(pD) outlier

1

(2π)12

int O+D

OminusDeminus

x2

2 dx le 1minus p

Example If O is 3 (3 standard deviations away from themean) then it is a DB(01 0999) outlier

Thus for particular settings of D and p DB(pD) capturesstandard outliers

But much more general (eg any distance metric)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based methods (DB(kN)

DB(kN) anomaly [28] top N data instances whosedistances to its k-th nearest neighbor are largest

Several advantages Ranking for anomalies is more intuitiveSetting of parameters generally easier

A Simple Nested Loop (SNL) algorithm can be used to selectthe top N DB(kN) outliers Time complexity is O(n2d)where n is the database size and d is the dimensionality

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Pruning rule

DB(kN) anomaly [3] a data instance is not an anomaly ifits distance to its k-th current nearest neighbor is less thanthe score of the weakest anomaly among top N anomaliesfound so far

A large number of non-anomalies can be pruned withoutcarrying out a full data search

Complexity nearly O(n)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Examples of pruning technique

Non-anomalies are pruned earlier

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Strengths and weaknesses - Distance-based techniques

Do not make any assumption about the distribution of thedata

Scalable for large dataset (O(n))

Capable of finding only global anomalies

Can lead to non-intuitive results in Top-k situations

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

dimension 1

dim

en

sio

n 2

p1

p2

p3

p4

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Density-based anomaly

Calculate the density of an object based on the density of itsk nearest neighbours

density(p) = 1

sum

qisinNk(p)distk(p q)

|Nk(p)|

relative-density(p) =

density(p)1

|Nk(p)|sum

qisinNk(p)density(q)

anomaly-score(p) =1

relative-density(p)

LOF indicates a degree of local outlier-ness [6]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Strengths and Weaknesses

Can detect global and local anomalies

Cannot use pruning technique and has a complexity of O(n2)

Require a method combining the strengths of distance anddensity based approaches A distance based approach whichcan capture density

C

C

O

O1

1

2

2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Commute time

Commute time between i and j is the expected number ofsteps that a random walk starting at i will take to reach jonce and go back to i for the first time

Commute time can capture both the distance between pointsand the data densities

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Computation of commute time

Commute time can be computed using graph Laplacian matrixL

cij = VG(ei minus ej)TL+(ei minus ej)L+ pseudo-inverse of LVG graph volumeei i-th column of the identity matrix

Commute time is Euclidean distance in the space spanned byeigenvectors of L

cij = VG[(Sminus12V T )(ei minus ej)]T [(Sminus12V T )(ei minus ej)]

V S eigenvectors and eigenvalues of L

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 9: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

DB(pD) outlier

To build some intuition consider data generated from theNormal distribution N(0 1) Then if O is a DB(pD) outlier

1

(2π)12

int O+D

OminusDeminus

x2

2 dx le 1minus p

Example If O is 3 (3 standard deviations away from themean) then it is a DB(01 0999) outlier

Thus for particular settings of D and p DB(pD) capturesstandard outliers

But much more general (eg any distance metric)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based methods (DB(kN)

DB(kN) anomaly [28] top N data instances whosedistances to its k-th nearest neighbor are largest

Several advantages Ranking for anomalies is more intuitiveSetting of parameters generally easier

A Simple Nested Loop (SNL) algorithm can be used to selectthe top N DB(kN) outliers Time complexity is O(n2d)where n is the database size and d is the dimensionality

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Pruning rule

DB(kN) anomaly [3] a data instance is not an anomaly ifits distance to its k-th current nearest neighbor is less thanthe score of the weakest anomaly among top N anomaliesfound so far

A large number of non-anomalies can be pruned withoutcarrying out a full data search

Complexity nearly O(n)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Examples of pruning technique

Non-anomalies are pruned earlier

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Strengths and weaknesses - Distance-based techniques

Do not make any assumption about the distribution of thedata

Scalable for large dataset (O(n))

Capable of finding only global anomalies

Can lead to non-intuitive results in Top-k situations

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

dimension 1

dim

en

sio

n 2

p1

p2

p3

p4

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Density-based anomaly

Calculate the density of an object based on the density of itsk nearest neighbours

density(p) = 1

sum

qisinNk(p)distk(p q)

|Nk(p)|

relative-density(p) =

density(p)1

|Nk(p)|sum

qisinNk(p)density(q)

anomaly-score(p) =1

relative-density(p)

LOF indicates a degree of local outlier-ness [6]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Strengths and Weaknesses

Can detect global and local anomalies

Cannot use pruning technique and has a complexity of O(n2)

Require a method combining the strengths of distance anddensity based approaches A distance based approach whichcan capture density

C

C

O

O1

1

2

2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Commute time

Commute time between i and j is the expected number ofsteps that a random walk starting at i will take to reach jonce and go back to i for the first time

Commute time can capture both the distance between pointsand the data densities

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Computation of commute time

Commute time can be computed using graph Laplacian matrixL

cij = VG(ei minus ej)TL+(ei minus ej)L+ pseudo-inverse of LVG graph volumeei i-th column of the identity matrix

Commute time is Euclidean distance in the space spanned byeigenvectors of L

cij = VG[(Sminus12V T )(ei minus ej)]T [(Sminus12V T )(ei minus ej)]

V S eigenvectors and eigenvalues of L

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 10: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Distance-based methods (DB(kN)

DB(kN) anomaly [28] top N data instances whosedistances to its k-th nearest neighbor are largest

Several advantages Ranking for anomalies is more intuitiveSetting of parameters generally easier

A Simple Nested Loop (SNL) algorithm can be used to selectthe top N DB(kN) outliers Time complexity is O(n2d)where n is the database size and d is the dimensionality

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Pruning rule

DB(kN) anomaly [3] a data instance is not an anomaly ifits distance to its k-th current nearest neighbor is less thanthe score of the weakest anomaly among top N anomaliesfound so far

A large number of non-anomalies can be pruned withoutcarrying out a full data search

Complexity nearly O(n)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Examples of pruning technique

Non-anomalies are pruned earlier

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Strengths and weaknesses - Distance-based techniques

Do not make any assumption about the distribution of thedata

Scalable for large dataset (O(n))

Capable of finding only global anomalies

Can lead to non-intuitive results in Top-k situations

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

dimension 1

dim

en

sio

n 2

p1

p2

p3

p4

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Density-based anomaly

Calculate the density of an object based on the density of itsk nearest neighbours

density(p) = 1

sum

qisinNk(p)distk(p q)

|Nk(p)|

relative-density(p) =

density(p)1

|Nk(p)|sum

qisinNk(p)density(q)

anomaly-score(p) =1

relative-density(p)

LOF indicates a degree of local outlier-ness [6]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Strengths and Weaknesses

Can detect global and local anomalies

Cannot use pruning technique and has a complexity of O(n2)

Require a method combining the strengths of distance anddensity based approaches A distance based approach whichcan capture density

C

C

O

O1

1

2

2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Commute time

Commute time between i and j is the expected number ofsteps that a random walk starting at i will take to reach jonce and go back to i for the first time

Commute time can capture both the distance between pointsand the data densities

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Computation of commute time

Commute time can be computed using graph Laplacian matrixL

cij = VG(ei minus ej)TL+(ei minus ej)L+ pseudo-inverse of LVG graph volumeei i-th column of the identity matrix

Commute time is Euclidean distance in the space spanned byeigenvectors of L

cij = VG[(Sminus12V T )(ei minus ej)]T [(Sminus12V T )(ei minus ej)]

V S eigenvectors and eigenvalues of L

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 11: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Pruning rule

DB(kN) anomaly [3] a data instance is not an anomaly ifits distance to its k-th current nearest neighbor is less thanthe score of the weakest anomaly among top N anomaliesfound so far

A large number of non-anomalies can be pruned withoutcarrying out a full data search

Complexity nearly O(n)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Examples of pruning technique

Non-anomalies are pruned earlier

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Strengths and weaknesses - Distance-based techniques

Do not make any assumption about the distribution of thedata

Scalable for large dataset (O(n))

Capable of finding only global anomalies

Can lead to non-intuitive results in Top-k situations

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

dimension 1

dim

en

sio

n 2

p1

p2

p3

p4

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Density-based anomaly

Calculate the density of an object based on the density of itsk nearest neighbours

density(p) = 1

sum

qisinNk(p)distk(p q)

|Nk(p)|

relative-density(p) =

density(p)1

|Nk(p)|sum

qisinNk(p)density(q)

anomaly-score(p) =1

relative-density(p)

LOF indicates a degree of local outlier-ness [6]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Strengths and Weaknesses

Can detect global and local anomalies

Cannot use pruning technique and has a complexity of O(n2)

Require a method combining the strengths of distance anddensity based approaches A distance based approach whichcan capture density

C

C

O

O1

1

2

2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Commute time

Commute time between i and j is the expected number ofsteps that a random walk starting at i will take to reach jonce and go back to i for the first time

Commute time can capture both the distance between pointsand the data densities

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Computation of commute time

Commute time can be computed using graph Laplacian matrixL

cij = VG(ei minus ej)TL+(ei minus ej)L+ pseudo-inverse of LVG graph volumeei i-th column of the identity matrix

Commute time is Euclidean distance in the space spanned byeigenvectors of L

cij = VG[(Sminus12V T )(ei minus ej)]T [(Sminus12V T )(ei minus ej)]

V S eigenvectors and eigenvalues of L

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 12: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Examples of pruning technique

Non-anomalies are pruned earlier

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Strengths and weaknesses - Distance-based techniques

Do not make any assumption about the distribution of thedata

Scalable for large dataset (O(n))

Capable of finding only global anomalies

Can lead to non-intuitive results in Top-k situations

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

dimension 1

dim

en

sio

n 2

p1

p2

p3

p4

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Density-based anomaly

Calculate the density of an object based on the density of itsk nearest neighbours

density(p) = 1

sum

qisinNk(p)distk(p q)

|Nk(p)|

relative-density(p) =

density(p)1

|Nk(p)|sum

qisinNk(p)density(q)

anomaly-score(p) =1

relative-density(p)

LOF indicates a degree of local outlier-ness [6]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Strengths and Weaknesses

Can detect global and local anomalies

Cannot use pruning technique and has a complexity of O(n2)

Require a method combining the strengths of distance anddensity based approaches A distance based approach whichcan capture density

C

C

O

O1

1

2

2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Commute time

Commute time between i and j is the expected number ofsteps that a random walk starting at i will take to reach jonce and go back to i for the first time

Commute time can capture both the distance between pointsand the data densities

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Computation of commute time

Commute time can be computed using graph Laplacian matrixL

cij = VG(ei minus ej)TL+(ei minus ej)L+ pseudo-inverse of LVG graph volumeei i-th column of the identity matrix

Commute time is Euclidean distance in the space spanned byeigenvectors of L

cij = VG[(Sminus12V T )(ei minus ej)]T [(Sminus12V T )(ei minus ej)]

V S eigenvectors and eigenvalues of L

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 13: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDistance-based anomalies DB(pD) outlier Distance-based methods (DB(kN) Pruning rule Examples of pruning technique Strengths and weaknesses - Distance-based techniques

Strengths and weaknesses - Distance-based techniques

Do not make any assumption about the distribution of thedata

Scalable for large dataset (O(n))

Capable of finding only global anomalies

Can lead to non-intuitive results in Top-k situations

0 10 20 30 40 50 60 700

10

20

30

40

50

60

70

dimension 1

dim

en

sio

n 2

p1

p2

p3

p4

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Density-based anomaly

Calculate the density of an object based on the density of itsk nearest neighbours

density(p) = 1

sum

qisinNk(p)distk(p q)

|Nk(p)|

relative-density(p) =

density(p)1

|Nk(p)|sum

qisinNk(p)density(q)

anomaly-score(p) =1

relative-density(p)

LOF indicates a degree of local outlier-ness [6]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Strengths and Weaknesses

Can detect global and local anomalies

Cannot use pruning technique and has a complexity of O(n2)

Require a method combining the strengths of distance anddensity based approaches A distance based approach whichcan capture density

C

C

O

O1

1

2

2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Commute time

Commute time between i and j is the expected number ofsteps that a random walk starting at i will take to reach jonce and go back to i for the first time

Commute time can capture both the distance between pointsand the data densities

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Computation of commute time

Commute time can be computed using graph Laplacian matrixL

cij = VG(ei minus ej)TL+(ei minus ej)L+ pseudo-inverse of LVG graph volumeei i-th column of the identity matrix

Commute time is Euclidean distance in the space spanned byeigenvectors of L

cij = VG[(Sminus12V T )(ei minus ej)]T [(Sminus12V T )(ei minus ej)]

V S eigenvectors and eigenvalues of L

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 14: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Density-based anomaly

Calculate the density of an object based on the density of itsk nearest neighbours

density(p) = 1

sum

qisinNk(p)distk(p q)

|Nk(p)|

relative-density(p) =

density(p)1

|Nk(p)|sum

qisinNk(p)density(q)

anomaly-score(p) =1

relative-density(p)

LOF indicates a degree of local outlier-ness [6]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Strengths and Weaknesses

Can detect global and local anomalies

Cannot use pruning technique and has a complexity of O(n2)

Require a method combining the strengths of distance anddensity based approaches A distance based approach whichcan capture density

C

C

O

O1

1

2

2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Commute time

Commute time between i and j is the expected number ofsteps that a random walk starting at i will take to reach jonce and go back to i for the first time

Commute time can capture both the distance between pointsand the data densities

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Computation of commute time

Commute time can be computed using graph Laplacian matrixL

cij = VG(ei minus ej)TL+(ei minus ej)L+ pseudo-inverse of LVG graph volumeei i-th column of the identity matrix

Commute time is Euclidean distance in the space spanned byeigenvectors of L

cij = VG[(Sminus12V T )(ei minus ej)]T [(Sminus12V T )(ei minus ej)]

V S eigenvectors and eigenvalues of L

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 15: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsDensity-based anomaly Strengths and Weaknesses

Strengths and Weaknesses

Can detect global and local anomalies

Cannot use pruning technique and has a complexity of O(n2)

Require a method combining the strengths of distance anddensity based approaches A distance based approach whichcan capture density

C

C

O

O1

1

2

2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Commute time

Commute time between i and j is the expected number ofsteps that a random walk starting at i will take to reach jonce and go back to i for the first time

Commute time can capture both the distance between pointsand the data densities

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Computation of commute time

Commute time can be computed using graph Laplacian matrixL

cij = VG(ei minus ej)TL+(ei minus ej)L+ pseudo-inverse of LVG graph volumeei i-th column of the identity matrix

Commute time is Euclidean distance in the space spanned byeigenvectors of L

cij = VG[(Sminus12V T )(ei minus ej)]T [(Sminus12V T )(ei minus ej)]

V S eigenvectors and eigenvalues of L

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 16: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Commute time

Commute time between i and j is the expected number ofsteps that a random walk starting at i will take to reach jonce and go back to i for the first time

Commute time can capture both the distance between pointsand the data densities

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Computation of commute time

Commute time can be computed using graph Laplacian matrixL

cij = VG(ei minus ej)TL+(ei minus ej)L+ pseudo-inverse of LVG graph volumeei i-th column of the identity matrix

Commute time is Euclidean distance in the space spanned byeigenvectors of L

cij = VG[(Sminus12V T )(ei minus ej)]T [(Sminus12V T )(ei minus ej)]

V S eigenvectors and eigenvalues of L

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 17: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Computation of commute time

Commute time can be computed using graph Laplacian matrixL

cij = VG(ei minus ej)TL+(ei minus ej)L+ pseudo-inverse of LVG graph volumeei i-th column of the identity matrix

Commute time is Euclidean distance in the space spanned byeigenvectors of L

cij = VG[(Sminus12V T )(ei minus ej)]T [(Sminus12V T )(ei minus ej)]

V S eigenvectors and eigenvalues of L

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 18: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Anomaly detection using commute time (CDOF)

Construct the mutual k nearest neighbor graph G from thedataset

Compute the Laplacian matrix L of G and its eigensystems

Find top N anomalies using the distance-based technique incommute time with pruning rule

Complexity O(n3)

Commute time method can detect global local and groupanomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 19: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Fast estimation of commute time

Speilman and Srivastava [31] combined random projection anda linear time solver to build a structure where we can computethe compute time between two nodes in O(log n) time

Complexity of CDOF O(n3)rarr O(n log n)

Uses a near linear time solver for a linear system of equationAx = b

Spielman and Teng solvers Also see work by Iannis Koutisfrom CMU

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 20: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Scalability for Density-based method

The pruning rule for Distance-based methods does not applyto Density-based approaches

We can go from O(n2) to nearly O(n log n) by using an index

One solution for the curse of high dimensionality is to use ofrandom projections

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 21: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Algorithm (ICDM 2010)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 22: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

PINN Guarantee

The PINN Algorithm provides probabilistic guarantees

Under certain assumptions about intrinsic dimensionality (c)with high probability

1minus ε1 + ε

middot LOF (p) le LOF (p) le 1 + ε

1minus ε middot LOF (p)

In practice we do not know the intrinsic dimensionality ofdata However random projections are quite robust

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 23: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim distance-based outliers

On a large database of images the bright images show up asdistance based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 24: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples high-dim density-based outliers

On a large database of images occluded images show up asdensity based outliers

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 25: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCommute time Computation of commute time Anomaly detection using commute time (CDOF) Fast estimation of commute time Scalability for Density-based method PINN Algorithm (ICDM 2010) PINN Guarantee Examples high-dim distance-based outliers Examples high-dim density-based outliers Examples local density-based outliers

Examples local density-based outliers

Examples of images ranked by LOF

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 26: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Addressing Scalability

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 27: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Using GPUs for Anomaly Detection

Well suited for data parallel algorithms Using CUDA - Compute Unified Device

Architecture (Nvidia)

Need to re-engineer existing algorithms Utilization of device memory Minimize CPUhArrGPU transfer of data Keep threads homogeneous

Most model based algorithms arenaturally setup for the testing phase

Model building needs careful redesign

What about unsupervised algorithms

Kernel

Host

Grid

Device

Block(00)

Block(01)

Block(02)

Block(10)

Block(11)

Block(12)

Thread Block

Thread(00)

Thread(01)

Thread(02)

Thread(03)

Thread(10)

Thread(11)

Thread(12)

Thread(13)

Thread(20)

Thread(21)

Thread(22)

Thread(23)

Host Host Memory

SM 2

SM 1

SM 0

Registers Registers Registers

SP 0 SP 1 SP 7

Shared Memory

Global Memory

Texture Memory

Constant Memory

Device

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 28: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Implementing DB(kN) on GPUs

Return top N data instances whose distances tok-th nearest neighbor are largest (SerialDB(kN) is O(n2))

Involves computing pairwise distances

Load block i and block j to shared memory Data layout in memory should be optimized

Each thread computes distance between a pairof instances

Can utilize this time to load next chunk of datafrom host to device memory

Writes results to corresponding output block

Sorting can be done efficiently in CUDA [29]

d

n

T

n

n

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 29: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Moving Beyond Multi-dimensional Record Data

Categorical (Mixed)

Fraud Detection

Cyber Networks

Time Series

Sensor Networks

Healthcare

Discrete Sequences

Genomic

System Calls

Spatial

GIS

Image analysis

Spatio-temporal

Remote sensing

Climate

Graphs

Social networks

Epidemiology

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 30: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Handling Categorical Data

Each attribute can belong to one of many categories

No ordering between categories

Mixed data (categorical and continuous attributes)

cap-shape cap-surface middot middot middot habitat type

convex smooth urban poisonousconvex smooth grasses ediblebell smooth meadows edibleconvex scaly urban poisonousconvex smooth grasses ediblemiddot middot middot

Table Mushroom Data Set [2]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 31: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Approaches to Identify Categorical Anomalies

Using Association Analysis [24]

Binarize data

Learn rules (X rArr Y ) Choose high confidence

rules (P (Y |X))

For test record Z = 〈XY 〉find rules of the formP (Y |X)

Y is not observed whenX is observed

Using Bayesian Networks [33]

Learn Bayesian networkstructure and parameters

Compute P (Z) for test datarecord Z

Flag anomaly if P (Z) lt δ

Using Similarity Metrics [10]

Use a similarity measure (S(X1 X2))

Apply distancedensityclustering based method (eg lof)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 32: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Probability Test [18]

Identify unusual combinations of attribute values

r(at bt) =P (at bt)

P (at)P (bt)

A capB = φ Assumption If r(at bt) is low and is observed in test record t

then t is anomalous For a test record t

For each mutually exclusive pair of attribute sets ABcompute r(at bt)

Score t based on all r-values Assign minimum r-value as score Take product of all r-values

Need to compare exponential pairs of subsets Only consider subsets upto size k Ignore subsets with frequency less than a threshold α Avoid comparing independent subsets of attributes

micro(AB) ge βmicroChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 33: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Estimating Probabilities for CPT [18]

Maximum LikelihoodEstimation

P (at bt)

P (at)P (bt)=C(at bt)

Ntimes N

C(at)times N

C(bt)

C(at) Number of traininginstances with A = at

N Total number of traininginstances

Laplace Smoothing

E(p) =C(p) + 1

N + 2

r(at bt) =E(at bt)

E(at)times E(bt)

Speedup Tricks

Replace rare attributevalues with genericattribute (reduce arity)

Use efficient datastructure to querying forcounts(AD Trees [26])

ADTrees work fasterfor low arity attributes

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 34: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Anomalies in Discrete Sequences

Many problem formulations

1 Anomalous symbols in asequence

2 Anomalous subsequence in asequence

3 Anomalous sequence in adatabase of sequences

See [11] for a comparativeevaluation [9] for a survey

login pwdmail ssh mail web logoutlogin pwdmail web web web web logoutlogin pwdmail ssh mail web web logoutlogin pwdwebmail ssh webmail logoutloginpwd loginpwd loginpwd logout

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 35: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Treating Sequences as Points

Utilize a distancesimilarity measure Plug into a distancedensityclustering based method

Simplest Hamming Distance

h(Ai Bi) = 1 Ai 6= Bi= 0 Ai = Bi

H(AB) =nsumi=1

h(Ai Bi)

Issues Unequal lengths misalignment Normalized Length of Longest Common Subsequence

D(AB) = 1minus |LCS(AB)|radic|A||B|

Standard Dynamic Programming method is slow Faster versions available (Hunt Szymnaski method [7])

Weaknesses Cannot localize anomalies within a sequence Weak anomaly signals might get lost

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 36: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Sliding Windows

Slide a window of size k

Extract all windows from a sequence (nminus k + 1)

Training (Creating a normal dictionary) Store all uniquewindows in all normal sequences and their counts

Testing For each window find the frequency in normal dictionary Anomaly score is inverse of the aggregate frequencies for all

windows (normalized by length)

Many variants exist For each window find the hamming distance to the closest

window in the normal dictionary [16]

Issues Penalizes low frequency windows in the normal dictionary Rewards high frequency windows that might not be relevant Can construct anomalous sequences that will escape detection

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 37: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Probabilistic Models

Probability of occurrence of sequence S

P (S) =nprodi=1

P (Si|S1 Siminus1)

Short memory property of sequences

P (Si|S1 Siminus1) = P (Si|Siminusk Siminus1)

Conditional probability estimates for a symbol Si

P (Si|Siminusk Siminus1) =f(Si|Siminusk Si)f(Si|Siminusk Siminus1)

f is estimated from the normal dictionary

Anomaly score for a test sequence is inverse of the normalizedprobability of occurrence

Issues What if the suffix occurs very infrequently in thenormal data (or not at all)

Replace with the longest suffix that occurs sufficient number oftimes [32] - Probabilistic Suffix Trees

Significantly reduces the size of the modelChawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 38: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Contextual Anomalies

Sometimes contextual information isavailable about data

Not used directly as a feature Are well understood no anomalies in

the context Can reduce false positives and yield

interesting anomalies

Example adapted from [30]

Clearest Outlier

Conditional Anomaly

Maximum Daily Temperature

Number

ofFever

Cases

Contextual anomalies - Anomalous with respect to a context Context is defined using environmental variables

Spatial (Latitude Longitude) Graph context (Edges Weights) Temporal location Domain specific (Demographic other)

How to incorporate context Reduce to traditional anomaly detection (subset on context) Explicitly model contextual information (time series spatial)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 39: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Conditional Anomaly Detection

Data instance d rArru1 u2 udU v1 v2 vdV

dU environmentalattributes

dV indicator attributes u1

u2

U1

U2

U3

v1

v2

V1V2

V3

p(V1|U3) = 02

p(V2|U3) = 02

p(V3|U3) = 06

Algorithm [30]

1 Learn a Gaussian Mixture Model (GMM)U = U1 U2 UnU

each with dimensionality dU2 Learn a set of Gaussians U = V1 V2 VnV

each withdimensionality dV

3 Learn a probabilistic mapping function p(Vj |Ui)4 Score a test instance d = [uv]

S =

nUsumi=1

p(u|Ui)nVsumj=1

p(v|Vj)p(Vj |Ui)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 40: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Finding Collective Anomalies

Find a collection of data points

Each point by itself is normal

The collection as a whole isanomalous

Relevant when data has inherentstructure and

When domain definition ofanomalies cannot be described aspoint anomalies

A Simple Solution

1 Break data intogroups

2 Compute features foreach group

3 Apply traditionalanomaly detection

Examples

1 Time series

2 Image

3 Spatial clusters ofgalaxies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 41: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Using Latent Dirichlet Allocation for Group Anomalies

Find anomalous groups in data [34]

Example Spatial clusters of galaxies topics red green emissive words continuous features

Flexible Genre Model (FGM)

For each group

1 Draw a genre 1 2 T 3 ym simM(π)2 Draw topic distribution forym SK 3 θm sim Dir(αym)

3 Draw K topicsβmk sim P (βmk|ηk)k=12K

4 For each point in group

1 Draw topic membershipzmn simM(θn)

2 Generate pointxmn isin P (xmn|βmzmn)

α

zmn

xmn

N

θm

T

π

ηK

ym

β

MK

Model Parameters

M(π) -Multinomial

Genre - Dir(αt)

Topic generatorsP (|ηk)- GIW

Point generatorsP (xmn|βnk) -MultivariateGaussian

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 42: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsCategorical Data Discrete Sequences and Time Series Contextual Anomalies Collective Anomalies

Inference and Testing for FGM

Inference and Learning Parameters

Approximate inference of latentvariables (Gibbs Sampling)

Use samples to learn parameters(Single step Monte Carlo EM)

Topic 1

Topic 2

Topic 3

Anomaly Detection

Infer the topic distribution θm

Compute negative log likelihood wrt αt

Rationale An anomalous group will be unlikely to begenerated from any genre

Geometric interpretation Mapping each group into a Tdimensional space and finding anomalies

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 43: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Labels

Labeled validation data set exists Confusion matrix Traditional evaluation metrics

Class imbalance

ROC Curve

Validation set does not exist Use domain expertise to find TP

and FP FN is harder to estimate

Pseudo false negativeestimation techniques [25]

Predicteda n

Actuala TP FNn FP TN

Acc =TP + TNsum

Rec (R) =TP

TP + FN

Prec (P ) =TP

TP + FP

F =2 lowastR lowast PR+ P

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 44: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Evaluating Anomaly Detection Methods - Scores

Convert to binary output Use threshold δ on score (Scale

issues [21]) Take top x as anomalies

ROC curve by varying x or δ

Quality of output Does the output ldquosuggestrdquo x or δ Which output is better

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 45: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Unifying Scores

Different methods assign scores in differentranges

kNN-based scores [0 1] Lof scores [1 6] ABOD scores [0 80000]

Anomalies have lower scores

Direct scaling to [0 1] might lose distinctionbetween normal and anomalies

Desired scaling Stretch interesting rangesand shrink irrelevant ones

Generalized Procedure for Normalizing Outlier Scores [21] Regularity rArr S(o) ge 0forallo S(o) asymp 0 if o is normal andS(o) 0 if o is anomalous

Normality S is regular and S(o) isin [0 1]forallo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 46: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Regularization and Normalization of Scores

Regularization

1 R(o) = max0 S(o)minus baseS2 R(o) = Smax minus S(o)

3 R(o) = minus log SmaxS(o)

Normalization

1 N(o) = S(o)Smax

2 N(o) = max(

0 erf(S(o)minusmicroSσS radic2

))(Gaussian Scaling)

Suited for high dimensional data

3 N(o) = max(

0cdfγS (o)minusmicroγ

1minusmicroγ

)(Gamma Scaling)

Where cdfγS (o) = P (k S(o) θ) P is the regularized Gamma function Suited for low dimensional data

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 47: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutions

Generating Labeled Data for Validation

Generating Both Normal and Anomalous Data

Use generative models for normal and anomalous behavior

Several generators available Multivariate continuous data [27] Multivariate categorical data [5] Discrete sequences using HMM [12]

Drawbacks Might not capture the domain characteristics

Injecting Anomalies - Random Perturbation[30]

Given data point z = xy x and y arepartitions of feature space

Take a random sample D of the entire data set

Let zprime = xprimeyprime isin D such that distancebetween y and yprime is maximum

Replace x with xprime and add z back to data setx

yz

zprime

Injected anomaly

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 48: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Applications Overview

How to set up an anomaly detection solution for agiven application domain

Available data Define anomalies define normal behavior Identify requirements and constraints (online

real-time limited resources) What domain knowledge available

Feature identification Defining normal and anomalous behavior Tuning parameters

Available ground truth (training validation)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 49: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

What is Network Anomaly Detection

Anomaly Detection or Intrusion Detection

Traditional Intrusion Detection Systems (IDS) finding attackscorresponding to predefined pattern data set known as signaturestherefore system is absolutely vulnerable against zero-day attacks

Network Anomaly Detection Systems (NADS) to detectzero-day attacks without any pre-identified signature besides profilenormal behavior of the network and address suspected incidents

Network Anomaly Detection finding unusual and largechanges in the network traffic

Examples intentional attacks (eg Distributed Denial ofService - DDoS) or unusual network traffic (eg flash crowds)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 50: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsWhat is Network Anomaly Detection Motivation How Much Serious

Motivation How Much Serious

According to Symantec report released in early 2011 morethan 286 million new threats have been detected in 2010which is a huge number

1656227

2704836

1000000

1500000

2000000

2500000

3000000

20547 18827 69107 113025 140690

624267

1656227

2704836

0

500000

1000000

1500000

2000000

2500000

3000000

2002 2003 2004 2005 2006 2007 2008 July2009-June2010

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 51: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Topology

A Typical network

1 y

Router

2

3

41t

x2t

x1t

x3t

x4t

Origin-Destination (OD) flow is the traffic that enters at anorigin node and exits at a destination node of a backbonenetwork x1t x1t

Link measurement is the traffic enters at an node during aninterval y1t

Relationship between link traffic and OD flow traffic iscaptured by the routing matrix A

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 52: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 53: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Network Anomalies Detection Problem

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 54: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsNetwork Topology Network Anomalies Detection Problem Network Anomalies Detection Problem Why care about OD Flows

Why care about OD Flows

Volume anomaly typically arises on an OD flow (trafficarriving at one node and destined for another node)

If we only monitor traffic on network links volume arisingfrom an OD flow may not be noticeable Thus naiveapproach wont work if OD flow info is not available

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 55: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

PCA and Subspace Method

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 56: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Method Algorithm

Step1- Determine the PCs based on eigenvaluedecomposition of the covariance matrix of the dataset

Step2- Choose first top k principle components with thehighest eigenvalues as matrix P

Step3-normal traffic subspace called Z

Z = PP TZ = CZ

Step4-abnormal traffic subspace called Z

Z = (I minus PP T )Z = CZ

Step5-If the norm of a vector is large then it is an rdquoanomalyrdquo

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 57: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Subspace Analysis Results

Note that during anomaly normal component does not changethat much while residual component changes quite a lot

Thus anomalies can be detected by setting some threshold

Figure source [22]

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 58: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Discussion Typical Characteristics of Anomaly

Most Anomalies induce a change in distributionalaspects of packet header fields (called features)

Most important features include 5-tuple Source ampdestination IP addresses Source and destination portnumbers and IP protocol

raquo DOS attack ndash multiple source IP address concentrated on asingle destination IP address

raquo Network scan ndash dispersed distribution of destination addresses

raquo Most wormsviruses also induce some change in distribution ofcertain features

raquo However these changes can be very subtle and mining them islike searching for needles in a haystack

Unlike many previous approach this paper aims todetect events which disturb the distribution of trafficfeatures rather than traffic volume

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 59: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Limitation of Volume

Figure source [22]

Port scan anomaly (traffic feature changes however traffic volume

remains more or less the same)

We can use entropy tocapture the variationsin the traffic feature

bullTakes value 0 when distribution is maximally concentratedbullTakes value log2N when distribution is maximally dispersed

But stands out in

feature entropy

which also reveals

its structure

Port scan

dwarfed

in volume

metricshellip

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 60: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Entropy Based versus Volume Based

DoSDDoS Attacks- a spike in traffic data toward adominant destination IP

Scan anomaly-a spike in traffic data from a dominant sourceIP

Flash Crowd anomaly- again a spike in traffic data to adominant destination IP

Worm anomaly-a Spike in traffic with a dominant port

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 61: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Fraud Detection

Domain Question Identifyfraudulent activities or players fromobserved transaction data

Data Transactions between different

players in the system Meta information about the

individuals An underlying graph structure

Challenges Track and model human behavior Anomalies caused by adaptive

human adversaries Massive data sizes

Insurance (autohealtha)

ClaimantProvider Payer

Telecommunications Customer

Provider

Credit Cards Customer

Supplier Bank

Web Advertising User Advertiser

Publisher

ahttpwwwcheatingculturecompast-cases-of-medicare-fraud

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 62: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

A Generic Fraud Detection Method

Activity Monitoring [13]1 Build profiles for individuals (customers users etc) based on

historic data User X makes n calls on an average in January

2 Compare current behavior with historical profile for significantdeviations

Clustering based [4]1 Cluster historical profiles of customers2 Identify small clusters or outlying profiles as anomalies

Strengths Anomaly detection is fast (good for real time) Results are easy to explain

Weaknesses Need to create and maintain a large number of profiles Not dynamic Adequate historical data might not be available Too many false positives

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 63: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Weighted Graphs

Represent data as a weighted graph Communication networks (phone email SMS) Provider referral networks

Objective Identify anomalous nodes

For each node extract several features based on the propertiesof the induced sub-graph (egonet) of neighboring nodes [1]

Choose features that can highlight anomalous nodes

1 Ni degree of node2 Ei number of edges in egonet3 Wi total weight of egonet4 λi principal eigen vector of weighted adjacency matrix of

egonet

Data is transformed into a point in a multi-dimensional space

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 64: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Identifying Anomalies

Traditional anomaly detection (lof) Can be slow but can identify any type of anomalous structure

Faster method to identify specific types of anomalousstructures

Identify relevant feature pairs and power law relationship

Eg Egonet Density Power Law Ni vs Ei - detect nearcliques and stars

Ei prop Nαi 1 le α le 2

Anomaly score for node i wrt a pair of features (y = Cxθ)

Si =max(yi Cx

θi )

min(yi Cxθi )lowast log (|yi minus Cxθi |+ 1)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 65: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Exploiting Graph Structure - Bipartite Graphs

Represent data as a bipartite graph Healthcare Data (Beneficiaries vs Providers) Insider trading (Traders vs Stocks)

Objective Identify anomalous links

Given a query node a isin V1 find the ldquorelevancerdquoof all other nodes in V1 to a

RelevanceScore(a b) prop Number of times aldquorandom walkrdquo from a reaches b

Use the relevance scores to compute thenormality scores for a node t isin V2

Find set St = a|〈a t〉 isin E Compute |St| times |St| similarity matrix using

relevance vectors for a isin St Normality Score = mean of non-diagonal entries

of similarity matrix

V1 E V2a1

a2

a3

an

t1

t2

t3

tn

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 66: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Detecting Disease Outbreaks

Domain Question Early detection of disease

outbreaks Anthrax attack

Data Emergency department visits Grocery data (Example [14]) Clinical visits Weatherclimate data

Challenges Weak signals in the data

ED cases involving cough rArrFlu or SARS

Integration of multiple signals(lag analysis)

Account for spatial and temporalcorrelations

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 67: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Whatrsquos Significant About Recent Events (WSARE) [33]

1 Learn Bayesian network from historical data Environmental and response variables

2 Sample from the BN (DBbaseline|Current Environment)3 Compute contingency table for rules for DBbaseline andDBcurrent

Rules are single assignment rules (Xi = Y ji ) or conjunctions4 Find p-value for rules using χ2-test

Null Hypothesis Rows and columns of tables for DBbaselineand DBcurrent are independent

5 Find rule with largest p-value Repeat Step 2

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 68: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Incorporating Spatial and Temporal Relationships

WSARE does not explicitly model the spatial and temporalrelationships

What happened yesterday What happened in the adjoining neighborhood (yesterday)

Bayesian Network Spatio-Temporal (BNST) modelingframework [17]

Add nodes for temporal and spatial dependencies Need more data to train

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 69: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection in Climate and Weather

Science Questions Identify natural and anthropogenic disasters Identify long time scale events - droughts atmospheric rivers

cold fronts etc

Data Ground observations Remote sensing data (satellites

air-borne) Climate model simulation outputs Multiple variables spatio-temporal (often has height dimension

as well)

Challenges Model spatio-temporal relationships across multiple variables Explain the cause of anomalies Massive data sizes

Climate and weather extreme events are well defined

Key challenge is to find significant events and explain the cause

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 70: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomalies are Widely Used in Climate

Most analysis done on ldquoanomalyrdquo time series

Difference from a ldquobase periodrdquo (Toosimplistic)

Brings spatial smoothness (eg a mountaintop and nearby valley can have very differenttemperatures) Removes seasonality

Understand climate and weatherphenomenon

Southern Oscillation Index (SOI) Difference between Sea Level Pressure

(SLP) anomalies for Tahiti and DarwinAustralia

Figure Global AverageTemperature Anomaly(1975 - 2007) Srcwwwmetofficegovuk

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 71: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Constructing Anomalies from Raw Data

Anomaly time series for a given location i

vprimei = vi minus bi

where bi is the base (reference)

How to choose bi Mean of all data for location i Monthly mean values (account for seasonality) Monthly z-score values Median (more robust) Using a shorter ldquoreference periodrdquo

30 year moving window

Different methods show statistically significant differences[19] What is the right strategy Weighted mean of different strategies (Pick weights using

Monte Carlo sampling)

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 72: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Anomaly Detection for Identifying Droughts

Science question Identify significant drought patterns usinghistorical observation data or future simulation data or both

Find persistent spatiotemporal anomalies in precipitation data

A two step approach

1 Find precipitationanomalies usingthresholds

2 Find large connectedcomponents across spaceand time

Matlab - bwlabelbwlabeln

Followup Science questionExplain cause

(drought190120061avi)

Figure Video courtsey Dr ArindamBanerjee

Chawla and Chandola Anomaly Detection

drought_1901_2006_1avi
Media File (videoavi)

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 73: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

More Climate Extreme Events using Anomaly Detection

Atmospheric Rivers [8]

Water Vapor Content

Anomalies using a threshold

Connected components Example Src -

http newscenter lbl gov

Cold Fronts [23]

Surface winds and Potential temperature fields

Methodology

1 Compute features for every grid2 Cluster grids into K clusters3 Label clusters as anomalous or not using thresholds4 Filter out false positives using domain knowledge

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 74: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Validation is Key

How useful are the anomalies from the domain perspective

Common pitfalls Anomalies are algorithmically correct but are not relevant (bad

data noise simplistic) Anomalies are not actionable

Not identified in timely fashion Resolution is not fine enough Cause not explained

Anomalies lost among false positives

Solution Good validation data during design Clear definition of a domain anomaly and distinction from

other potential competitors

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 75: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

L Akoglu M McGlohon and C Faloutsos

OddBall Spotting Anomalies in Weighted GraphsIn In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD) June 2010

A Asuncion and D J Newman

UCI machine learning repository[httparchiveicsucieduml] Irvine CA University of California 2007

S D Bay and M Schwabacher

Mining distance-based outliers in near linear time with randomization and a simple pruning ruleIn Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining pages 29ndash38 ACM Press 2003

R Bolton and D Hand

Unsupervised profiling methods for fraud detectionIn Credit Scoring and Credit Control VII 1999

S Boriah V Chandola and V Kumar

Similarity measures for categorical data A comparative evaluationIn SDM pages 243ndash254 2008

M M Breunig H-P Kriegel R T Ng and J Sander

Lof identifying density-based local outliersIn Proceedings of 2000 ACM SIGMOD International Conference on Management of Data pages 93ndash104ACM Press 2000

S Budalakoti A Srivastava and M Otey

Anomaly detection and diagnosis algorithms for discrete symbol sequences with applications to airlinesafetyProceedings of the IEEE International Conference on Systems Man and Cybernetics 37(6) 2007

S Byna Prabhat M Wehner and K Wu

Detecting atmospheric rivers in large climate datasets

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 76: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

In 2nd International Workshop on Petascale Data Analytics Challenges and Opportunities (PDAC-11)2011

V Chandola A Banerjee and V Kumar

Anomaly detection for discrete sequences A surveyIEEE Transactions on Knowledge and Data Engineering 99(PrePrints) 2010

V Chandola S Boriah and V Kumar

A framework for exploring categorical dataIn Proceedings of the ninth SIAM International Conference on Data Mining 2009

V Chandola V Mithal and V Kumar

A comparative evaluation of anomaly detection techniques for sequence dataIn Proceedings of International Conference on Data Mining 2008

V Chandola V Mithal and V Kumar

Understanding anomaly detection techniques for sequence dataTechnical Report 09-001 University of Minnesota Computer Science Department January 2009

T Fawcett and F Provost

Activity monitoring noticing interesting changes in behaviorIn Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and DataMining pages 53ndash62 ACM Press 1999

A Goldenberg G Shmueli R A Caruana and S E Fienberg

Early statistical detection of anthrax outbreaks by tracking over-the-counter medication salesProceedings of the National Academy of Sciences of the United States of America 99(8)pp 5237ndash52402002

D Hawkins

Identification of outliersMonographs on Applied Probability and Statistics May 1980

S A Hofmeyr S Forrest and A Somayaji

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 77: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Intrusion detection using sequences of system callsJournal of Computer Security 6(3)151ndash180 1998

X Jiang and G F Cooper

A bayesian spatio-temporal method for disease outbreak detectionJAMIA pages 462ndash471 2010

J S Kaustav Das

Detecting anomalous records in categorical datasetsIn Proc of the thirteenth ACM SIGKDD international conference on Knowledge discovery and data miningAug 2007

J Kawale S Chatterjee A Kumar S Liess M Steinbach and V Kumar

Anomaly construction in climate data Issues and challengesIn Proceedings of NASA Conference on Intelligent Data Understanding 2011

E M Knorr and R T Ng

Algorithms for mining distance-based outliers in large datasetsIn Proceedings of the 24rd International Conference on Very Large Data Bases pages 392ndash403 MorganKaufmann Publishers Inc 1998

H-P Kriegel P Kroger E Schubert and A Zimek

Interpreting and unifying outlier scoresIn SDM pages 13ndash24 2011

A Lakhina M Crovella and C Diot

Diagnosing network-wide traffic anomaliesIn Proceedings of ACM SIGCOMM pages 219ndash230 2004

X Li R Ramachandran S Graves S Movva B Akkiraju D Emmitt S Greco R Atlas J Terry and

J-C JusemAutomated detection of frontal systems from numerical model-generated dataIn Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in datamining KDD rsquo05 pages 782ndash787 New York NY USA 2005 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 78: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

M V Mahoney and P K Chan

Learning rules for anomaly detection of hostile network trafficIn ICDM pages 601ndash604 2003

S V Mane

False negative estimation theory techniques and applicationsProQuest UMI Dissertation Publishing 2011

A Moore and M S Lee

Cached sufficient statistics for efficient machine learning with large datasetsJ Artif Int Res 867ndash91 March 1998

Y Pei and O Zaane

A synthetic data generator for clustering and outlier analysisTechnical report University of Alberta 2006

S Ramaswamy R Rastogi and K Shim

Efficient algorithms for mining outliers from large data setsIn Proceedings of the 2000 ACM SIGMOD international conference on Management of data pages427ndash438 ACM Press 2000

N Satish M Harris and M Garland

Designing efficient sorting algorithms for manycore gpusIn IPDPS pages 1ndash10 2009

X Song M Wu C Jermaine and S Ranka

Conditional anomaly detectionIEEE Trans on Knowl and Data Eng 19631ndash645 May 2007

D A Spielman and N Srivastava

Graph sparsification by effective resistancesIn Proceedings of the 40th annual ACM symposium on Theory of computing STOC rsquo08 pages 563ndash568New York NY USA 2008 ACM

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 79: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

P Sun S Chawla and B Arunasalam

Mining for outliers in sequential databasesIn In SIAM International Conference on Data Mining 2006

W-K Wong A Moore G Cooper and M Wagner

Bayesian network anomaly pattern detection for disease outbreaksIn T Fawcett and N Mishra editors Proceedings of the Twentieth International Conference on MachineLearning pages 808ndash815 Menlo Park California August 2003 AAAI Press

L Xiong B Poczos and J Schneider

Group anomaly detection using flexible genre modelsIn NIPS 2011

Chawla and Chandola Anomaly Detection

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies
Page 80: Anomaly Detection: A Tutorial - webdocs.cs.ualberta.cawebdocs.cs.ualberta.ca/~icdm2011/downloads/ICDM2011_anomaly_detection... · IntroductionData Mining MethodsDensity-based anomalyCommute

Introduction Data Mining Methods Density-based anomaly Commute Time Approach Addressing Scalability Anomalies in Complex Data Evaluating Anomaly Detection Methods Applications Problem solutionsPCA and Subspace Method Subspace Method Algorithm Subspace Analysis Results Discussion Typical Characteristics of Anomaly Limitation of Volume Entropy Based versus Volume Based Fraud Detection Epidemeology Anomaly Detection in Climate and Weather Anomalies in Climate Validation of Anomalies

Acknowledgements

Linsey Pang Tara Babie and Khoa Nguyen (University ofSydney)

Arindam Banerjee (University of Minnesota)

Chawla and Chandola Anomaly Detection

  • Introduction
    • Outline
      • Data Mining Methods
      • Density-based anomaly
      • Commute Time Approach
      • Addressing Scalability
      • Anomalies in Complex Data
        • Categorical Data
        • Discrete Sequences and Time Series
        • Contextual Anomalies
        • Collective Anomalies
          • Evaluating Anomaly Detection Methods
          • Applications
            • Anomaly Detection in Computer Networks
              • Problem
              • solutions
                • Fraud Detection
                • Epidemeology
                • Anomaly Detection in Climate and Weather
                • Anomalies in Climate
                • Validation of Anomalies