Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie

Anomaly Detection in Data Mining.Anomaly Detection in Data Mining.Hybrid Approach between Filtering-Hybrid Approach between Filtering-

and-refinement and DBSCANand-refinement and DBSCAN

Eng. Ştefan-Iulian HandraProf. Dr. Eng. Horia Ciocârlie

SACIMay 2011

CoContentsntents1.1. IntroducIntroductiontion2.2. Anomaly detection classical Anomaly detection classical

approachesapproaches3.3. Filtering-and-refinementFiltering-and-refinement4.4. Hybrid methodHybrid method5.5. Experimental resultsExperimental results6.6. ConcluConclusions and Further sions and Further

DevelopmentDevelopment7.7. BibliographyBibliography

1/19

Anomaly detectionAnomaly detection : :

the process of finding individual the process of finding individual objects that are different from the objects that are different from the normal objectsnormal objects

ApplicationsApplications ::

critical safe systems, insurance, critical safe systems, insurance, health, electronic and bank fraud health, electronic and bank fraud detection, military surveillance of detection, military surveillance of enemy activities, data miningenemy activities, data mining

1.1. IntroducIntroductiontion

2/19

2.2. Classical techniques Classical techniques

The Nearest Neighbor approach:

- calculates the distance between every analyzed instance from the data set and the k-th nearest neighbor

- sparse instances are considered anomalies, dense instances are considered normal instances

The Density based Local Outliers approach:

- assigns local outlier factor to describe the degree in which the instance is outlier to a local neighborhood

- average density of the instance is compared with the average density of its nearest neighbors 3/19


The DBSCAN algorithm:

- well known clustering algorithm

- based on the density-reachability and density-connectivity concepts

- it does not assign all the entries to a cluster

- weaknesses: lacks scalability and fast response capabilities

4/19


The Random Forest approach:

- ensemble of individual tree predictors

- each tree depends on the values of a random vector sampled independently with the same distribution in all the trees

- advantage: discovers new patterns that the Euclidian distance does not

- weakness: working with labeled data and calculation speed

5/19

33. . Filtering-and-refinementFiltering-and-refinement

- classical methods focus on normal instances for detecting anomalies

- F&R introduces a change of paradigm: it focuses on the anomalies and not on the normal instances

- two stage approach

6/19


7/19

-Filtering stage:

- removes majority of normal instances

Refinement stage:

- examines data with different density based measures


Advantages:

- saves the majority of the processing time by only analyzing the remaining data in the second step

- flexible and combinable with different density based algorithms

Disadvantage: not really tested in practice

8/19

44. . Hybrid methodHybrid method

- combination between Filtering-and-refinement and DBSCAN

- filtering stage : using average value

- refinement stage : using DBSCAN

- JAVA routines for filtering stage

- WEKA processing for refinement stage

9/19


10/19

Two separate implementations:

- F&R1 : in the filtering stage we removed the largest possible percentage of normal instances (~85%)

- F&R2 : in the filtering stage we removed a consistent percentage of normal instances (~65%)


11/19

- automatically generated anomalies

- we modeled the data set in JAVA to be able to differentiate the anomalies from the normal instances

- 3 separate runs to compare the results (F&R1, F&R2, normal)

55. . Experimental resultsExperimental results

12/19

5.1. Data sets used

- 24 variations of data sets each containing over 20.000 entries

- data sets consisting of one letter column followed by 16 numeric features columns describing the letter they belong to

- for each run the generated anomalies are stored also in separate data sets for validation of the anomaly detection


13/19

0

20

40

60

80

100

120

Anomalies

Anomaliesdiscovered

5.2. Results


14/19

Anomalies discovered ratio

00.10.20.30.40.50.60.70.80.9

1

1FR1

-50

1Nor

mal

-50

2FR2

-50

3FR1

-50

3Nor

mal

-50

1FR2

-100

2FR1

-100

2Nor

mal

-100

3FR2

-100

1FR1

-500

1Nor

mal

-500

2FR2

-500

3FR1

-500

3Nor

mal

-500

1FR2

-100

0

2FR1

-100

0

2Nor

mal

-100

0

3FR2

-100

0

Anomalies discovered ratio


- for F&R1 and F&R2 the most costly execution for the filtering stage was ~ 10 s

15/19

Approach Best Time(s) Worst Time(s)

FR1 3 29FR2 8 156

Normal 908 1070

5.2. Results

- both F&R approaches are more accurate compared to the classical approach

- F&R approach can also be applied to clustering algorithms which do not assign all the instances with strange properties to clusters

6.1. Conclusions

66. . Conclusions and Further Conclusions and Further DevelopmentDevelopment

16/19

- overall enormous speed gain compared to classical methods

- saves disk space and processing resources

- the hybrid method spends the majority of the time processing anomalies and not normal instances

6.1. Conclusions


17/19

- adaptation of algorithm to different domains

- use “filtered out” instances for training parallel neural networks

- experiment with a hybrid method between the RF predictor and the F&R approach

6.2. Further Development


18/19

- Xiao Yu, Lu An Tang, Jiawei Han. “Filtering and Refinement: A Two-Stage Approach for Efficient and Effective Anomaly Detection.” In Ninth IEEE International Conference on Data Mining, 2009

- Liu F.T., Ting K.M., and Zhou Z. “Isolation forest.” In ICDM’08, 2008

- Shi T. and Horvath S. “Unsupervised learning with random forest predictors.” In J. Computational and Graphical Statistics, 2006.

- Wenke Lee, Salvatore J. Stolfo, Philip K. Chan, Eleazar Eskin, Wei Fan, Matthew Miller, Shlomo Hershkop and Junxin Zhang, “Real Time Data Mining-based Intrusion Detection”, Conference paper of the North Carolina State University at Raleigh Department of Computer Science, Jan 2008

77. . BibliographyBibliography

19/19

SACI

Thank you for your attentionThank you for your attention!!

May 2011

Documents

Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie