Upload
roland-todd
View
229
Download
4
Embed Size (px)
Citation preview
Anomaly Detection in Data Mining.Anomaly Detection in Data Mining.Hybrid Approach between Filtering-Hybrid Approach between Filtering-
and-refinement and DBSCANand-refinement and DBSCAN
Eng. Ştefan-Iulian HandraProf. Dr. Eng. Horia Ciocârlie
SACIMay 2011
CoContentsntents1.1. IntroducIntroductiontion2.2. Anomaly detection classical Anomaly detection classical
approachesapproaches3.3. Filtering-and-refinementFiltering-and-refinement4.4. Hybrid methodHybrid method5.5. Experimental resultsExperimental results6.6. ConcluConclusions and Further sions and Further
DevelopmentDevelopment7.7. BibliographyBibliography
1/19
Anomaly detectionAnomaly detection : :
the process of finding individual the process of finding individual objects that are different from the objects that are different from the normal objectsnormal objects
ApplicationsApplications ::
critical safe systems, insurance, critical safe systems, insurance, health, electronic and bank fraud health, electronic and bank fraud detection, military surveillance of detection, military surveillance of enemy activities, data miningenemy activities, data mining
1.1. IntroducIntroductiontion
2/19
2.2. Classical techniques Classical techniques
The Nearest Neighbor approach:
- calculates the distance between every analyzed instance from the data set and the k-th nearest neighbor
- sparse instances are considered anomalies, dense instances are considered normal instances
The Density based Local Outliers approach:
- assigns local outlier factor to describe the degree in which the instance is outlier to a local neighborhood
- average density of the instance is compared with the average density of its nearest neighbors 3/19
2.2. Classical techniques Classical techniques
The DBSCAN algorithm:
- well known clustering algorithm
- based on the density-reachability and density-connectivity concepts
- it does not assign all the entries to a cluster
- weaknesses: lacks scalability and fast response capabilities
4/19
2.2. Classical techniques Classical techniques
The Random Forest approach:
- ensemble of individual tree predictors
- each tree depends on the values of a random vector sampled independently with the same distribution in all the trees
- advantage: discovers new patterns that the Euclidian distance does not
- weakness: working with labeled data and calculation speed
5/19
33. . Filtering-and-refinementFiltering-and-refinement
- classical methods focus on normal instances for detecting anomalies
- F&R introduces a change of paradigm: it focuses on the anomalies and not on the normal instances
- two stage approach
6/19
33. . Filtering-and-refinementFiltering-and-refinement
7/19
-Filtering stage:
- removes majority of normal instances
Refinement stage:
- examines data with different density based measures
33. . Filtering-and-refinementFiltering-and-refinement
Advantages:
- saves the majority of the processing time by only analyzing the remaining data in the second step
- flexible and combinable with different density based algorithms
Disadvantage: not really tested in practice
8/19
44. . Hybrid methodHybrid method
- combination between Filtering-and-refinement and DBSCAN
- filtering stage : using average value
- refinement stage : using DBSCAN
- JAVA routines for filtering stage
- WEKA processing for refinement stage
9/19
44. . Hybrid methodHybrid method
10/19
Two separate implementations:
- F&R1 : in the filtering stage we removed the largest possible percentage of normal instances (~85%)
- F&R2 : in the filtering stage we removed a consistent percentage of normal instances (~65%)
44. . Hybrid methodHybrid method
11/19
- automatically generated anomalies
- we modeled the data set in JAVA to be able to differentiate the anomalies from the normal instances
- 3 separate runs to compare the results (F&R1, F&R2, normal)
55. . Experimental resultsExperimental results
12/19
5.1. Data sets used
- 24 variations of data sets each containing over 20.000 entries
- data sets consisting of one letter column followed by 16 numeric features columns describing the letter they belong to
- for each run the generated anomalies are stored also in separate data sets for validation of the anomaly detection
55. . Experimental resultsExperimental results
13/19
0
20
40
60
80
100
120
Anomalies
Anomaliesdiscovered
5.2. Results
55. . Experimental resultsExperimental results
14/19
Anomalies discovered ratio
00.10.20.30.40.50.60.70.80.9
1
1FR1
-50
1Nor
mal
-50
2FR2
-50
3FR1
-50
3Nor
mal
-50
1FR2
-100
2FR1
-100
2Nor
mal
-100
3FR2
-100
1FR1
-500
1Nor
mal
-500
2FR2
-500
3FR1
-500
3Nor
mal
-500
1FR2
-100
0
2FR1
-100
0
2Nor
mal
-100
0
3FR2
-100
0
Anomalies discovered ratio
55. . Experimental resultsExperimental results
- for F&R1 and F&R2 the most costly execution for the filtering stage was ~ 10 s
15/19
Approach Best Time(s) Worst Time(s)
FR1 3 29FR2 8 156
Normal 908 1070
5.2. Results
- both F&R approaches are more accurate compared to the classical approach
- F&R approach can also be applied to clustering algorithms which do not assign all the instances with strange properties to clusters
6.1. Conclusions
66. . Conclusions and Further Conclusions and Further DevelopmentDevelopment
16/19
- overall enormous speed gain compared to classical methods
- saves disk space and processing resources
- the hybrid method spends the majority of the time processing anomalies and not normal instances
6.1. Conclusions
66. . Conclusions and Further Conclusions and Further DevelopmentDevelopment
17/19
- adaptation of algorithm to different domains
- use “filtered out” instances for training parallel neural networks
- experiment with a hybrid method between the RF predictor and the F&R approach
6.2. Further Development
66. . Conclusions and Further Conclusions and Further DevelopmentDevelopment
18/19
- Xiao Yu, Lu An Tang, Jiawei Han. “Filtering and Refinement: A Two-Stage Approach for Efficient and Effective Anomaly Detection.” In Ninth IEEE International Conference on Data Mining, 2009
- Liu F.T., Ting K.M., and Zhou Z. “Isolation forest.” In ICDM’08, 2008
- Shi T. and Horvath S. “Unsupervised learning with random forest predictors.” In J. Computational and Graphical Statistics, 2006.
- Wenke Lee, Salvatore J. Stolfo, Philip K. Chan, Eleazar Eskin, Wei Fan, Matthew Miller, Shlomo Hershkop and Junxin Zhang, “Real Time Data Mining-based Intrusion Detection”, Conference paper of the North Carolina State University at Raleigh Department of Computer Science, Jan 2008
77. . BibliographyBibliography
19/19
SACI
Thank you for your attentionThank you for your attention!!
May 2011