Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
ANOMALY DETECTION ON MACHINE LOG Data Mining
Prof. Sunnie S Chung
Ankur Pandit | 2619650
Raw Data: NASA HTTP access logs – It contain two month's of all HTTP requests to the NASA Kennedy Space Center WWW server in Florida. Format:
The logs are an ASCII file with one line per request, with the following columns:
1. host making the request. A hostname when possible, otherwise the Internet address if the name could not be looked up.
2. timestamp in the format "DAY MON DD HH:MM:SS YYYY", where DAY is the day of the week, MON is the name of the month, DD is the day of the month, HH:MM:SS is the time of day using a 24-‐hour clock, and YYYY is the year. The timezone is -‐0400.
3. request given in quotes. 4. HTTP reply code. 5. bytes in the reply.
199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245
unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985
Total Number of Records: 1.8 Million
Data Cleaning:
-‐ For convenience, space separated logs were converted into a CSV file. -‐ A simple java program was used for the conversion. (Link can be found in references
section) -‐ Special characters were removed by the program:
o double quotes (“) o comma (,) o square brackets ([])
-‐ 199.72.81.55,-‐,-‐,01/Jul/1995:00:00:01,-‐0400,GET,/history/apollo/,HTTP/1.0,200,6245 -‐ unicomp6.unicomp.net,-‐,-‐, 01/Jul/1995:00:00:06,0400 , GET , /shuttle/countdown/ ,
HTTP/1.0, 200,3985
Importing data in R:
-‐ Setup working directory first using setwd() command.
-‐ Import the csv data using read.csv(). -‐ Make sure you set header = TRUE, since we would need headers to access the data.
Outlier Detection:
-‐ Once we have imported the data we can start detecting outliers. -‐ Cluster plot for entire imported data. -‐ clusplot(data, data$col10, color=TRUE, shade=TRUE,labels=2, lines=0)
-‐ For sample data containing only two columns – IP address and number of bytes received.
-‐ These graphs shows us that are some outliers present but exactly what is the outlier, we cannot find it. So some algorithms must be applied to find the outliers.
Grubbs test:
-‐ Performs grubbs test for to detect if the sample dataset contains one outlier. -‐ Test is based on calculating outlier score G (outlier minus mean and divided
by standard deviation) and comparing it to appropriate critical values. -‐ Usage: grubbs.test(<data_set_name>) -‐ Expects a numeric vector as an input
-‐ Perform grubbs test to check highest and lowest values of outliers. -‐ Usage: grubbs.test(<data_set_name>,type=11)
-‐ There is another type available but it can be used only when the data set contains less than 30 rows.
Chi Square Test:
-‐ This function performs a simple test for one outlier, based on chi squared distribution of squared differences between data and sample mean.
-‐ Usage: chisq.out.test(<data_set_name>) – Gives the outlier with the highest value -‐ Usage: chisq.out.test(<data_set_name>,opposite=TRUE) – Gives the outlier with lowest
value
Outlier Test:
-‐ Finds value with largest difference between it and sample mean, which can be an outlier.
-‐ Usage: outlier(<data_set_name>) – Gives the outlier with the highest value. -‐ Usage: outlier(<data_set_name>, opposite=TRUE) – Gives the outlier with the lowest
value.
Limitations:
-‐ Doesn’t work that well with complex data set (more than two columns) -‐ We are not able to get other info like from which requester’s IP, resource accessed, data
and time when request was made etc. -‐ Problems with large data set. -‐ Just by using the algorithm we are not able learn anything about the working of the
algorithm. Giving us less control on the output. Using Custom Java Program:
-‐ Uses z score to detect outliers. -‐ Uses the difference between the value and mean of the data set. -‐ The difference is compared with standard deviation to find the outliers.
Output of Program:
Lessons Learned:
-‐ Data mining pipeline – Data gathering, Preprocessing and Analysis -‐ Various Outlier detection techniques and algorithms. -‐ Using R for outlier detection. -‐ Implementing Outlier Detection Algorithm.
Thank you
References:
1. http://ita.ee.lbl.gov/html/contrib/NASA-‐HTTP.html 2. https://github.com/Ankur-Pandit/CSVConverter 3. https://cran.r-‐project.org/web/packages/outliers/outliers.pdf 4. https://www.siam.org/meetings/sdm10/tutorial3.pdf 5. https://github.com/Ankur-‐Pandit/OutlierDetection