Upload
maude
View
75
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Anomaly Detection Systems. Contents. Statistical methods parametric non-parametric (clustering) Systems with learning. Anomaly detection. Establishes profiles of normal user/network behaviour Compares actual behaviour to those profiles - PowerPoint PPT Presentation
Citation preview
Anomaly Detection Systems
2/86
Contents• Statistical methods
– parametric– non-parametric (clustering)
• Systems with learning
3/86
Anomaly detection• Establishes profiles of normal
user/network behaviour • Compares actual behaviour to those
profiles• Alerts if deviations from the normal are
detected.
4/86
Anomaly detection• Profiles are defined as sets of metrics -
measures of particular aspects of user/network behaviour.
• Each metric is associated a threshold or permitted range of values.
5/86
Anomaly detection• Anomaly detection depends on an
assumption that users/networks exhibit predictable, consistent patterns of system usage.
• Adaptations to changes in behaviour over time are possible.
• The problem with anomaly detection– No set of metrics is rich enough to express all
anomalous behaviour.
6/86
Statistical methods• Statistical methods of anomaly detection
are categorized as– Parametric methods
• Assumptions are made about the underlying distribution of the data being analyzed.
– Non-parametric methods• Involve nonparametric data classification
techniques - cluster analysis.
7/86
Parametric methods• The usual assumption is that the
distributions of usage patterns are Gaussian:
2
20
2
21
xx
exf
x0 – mean
- standard deviation
8/86
Parametric methods• The Denning’s model (the IDES model for
intrusion).– Four statistical models may be included in the
system:• Operational model• Mean and standard deviation model• Multivariate model• Markov process model.
– Each model is suitable for a particular type of system metric.
9/86
Parametric methods• Operational model
– This model applies to metrics such as event counters for the number of password failures in a particular time interval.
– The model compares the metric to a set threshold, triggering an anomaly when the metric exceeds the threshold value.
10/86
Parametric methods• Mean and standard deviation model (1)
– A classical mean and standard deviation characterization of data.
– The assumption is that all the analyzer knows about system behaviour metrics are the mean and standard deviations.
11/86
Parametric methods• Mean and standard deviation model (2)
– A new behaviour observation is defined to be abnormal if it falls outside a confidence interval.
– This confidence interval is defined as ±d standard deviations from the mean for some parameter d (usually d =3).
12/86
Parametric methods• Mean and standard deviation model (3)
– This characterization is applicable to event counters, interval timers, and resource measures (memory, CPU, etc.)
– It is possible to assign weights to these computations, such that, for example, more recent data are assigned greater weights.
13/86
Parametric methods• Multivariate model (1)
– This is an extension to the mean and standard deviation model.
– It is based on performing correlations among two or more metrics.
– Instead of basing the detection of an anomaly strictly on one measure, one might base it on the correlation of that measure with another measure.
14/86
Parametric methods• Multivariate model (2)
– Example:• Instead of detecting an anomaly based solely on
the observed length of a session, one might base it on the correlation of the length of the session with the number of CPU cycles utilized.
15/86
Parametric methods• Markov process model (1)
– Under this model, the system considers each different type of audit event as a state variable and uses a state transition matrix to characterize the transition frequencies between states (not the frequencies of the individual states/audit records).
16/86
Parametric methods• Markov process model (2)
– A new observation is defined as anomalous if its probability, as determined by the previous state and value in the state transition matrix, is too low/high.
– This allows the system to detect unusual command or event sequences, not just single events.
– This introduces the concept of performing stateful analysis of event sequences (frequent episodes, etc.)
17/86
Parametric methods• Example - NIDES (Next-generation
Intrusion Detection Expert System) (1)– Developed by SRI (Stanford Research
Institute) in the 1990s.– Measures various activity levels.– Combines these into a single “normality”
measure and checks it against a threshold.– If the measure is above the threshold, the
activity is considered abnormal.
18/86
Parametric methods• Example - NIDES (2)
– NIDES measures (1)• Intensity measures
– An example would be the number of audit records (log entries) generated within a set time interval.
– Several different time intervals are used in order to track short-, medium-, and long-term behaviour.
19/86
Parametric methods• Example - NIDES (3)
– NIDES measures (2)• Distribution measures
– The overall distribution of the various audit records (log file entries) is tracked via histograms.
– A difference measure is defined to determine how close a given short-term histogram is to “normal” behaviour.
20/86
Parametric methods• Example - NIDES (4)
– NIDES measures (3)• Categorical data
– The names of files accessed or the names of remote computers accessed are examples of categorical data used.
21/86
Parametric methods• Example - NIDES (5)
– NIDES measures (4)• Counting measures
– These are numerical values that measure parameters such as the number of seconds of CPU time used.
– They are generally taken over a fixed amount of time or over a specific event, such as a single login.
– Thus, they are similar in character to intensity measures, although they measure a different kind of activity.
22/86
Parametric methods• Example - NIDES (6)
– The different measurements each define a statistic Sj .
– These measurements are assumed (designed to be) appropriate (this includes normalization), and are combined to produce a 2-like statistic:
n
jjSn
T1
22 1
23/86
Parametric methods• Example - NIDES (7)
– A more complicated measure would include the correlation between the events (as was done with IDES):
– Here, C is the correlation matrix between Si and Sj for all i and j. IS is called the IDES score.
Tnn SSSSIS ,,,, 11
1 C
24/86
Parametric methods• Example - NIDES (8)
– NIDES compares recent activity with past activity, using a methodology that amounts to a sliding window on the past.
– Thus it is designed to detect changes in activity and to adapt to new activity levels.
25/86
Parametric methods• Example - NIDES (9)
– NIDES intensity measures are counts of audit records per time unit etc.
– This provides an overall activity level for the system.
– These are updated continuously rather than recomputed at each time interval.
26/86
Parametric methods• Example - NIDES (10)
– Possible elements that can be monitored:• Average system load.• Number of active processes.• Number of E-mails received.• Different types of audit records (can be tracked
separately).
27/86
Parametric methods• Example - NIDES (11)
– The obvious extension of the intensity measures idea is to track the different types of audit records.
– This leads to a distribution (histogram) for the audit records.
28/86
Parametric methods• Example - NIDES (12)
– Similarly, one could track the sizes of E-mail messages received, or the types of files accessed.
– These can be updated continuously.– Distributions are then compared by means of
a squared error metric.
29/86
Parametric methods• Example - NIDES (13)
– Categorical measures can be for example the names of files accessed.
– They are treated just like distributional measures.
– Now each bin corresponds to a categorical, while with distributional measures the bin can correspond to a range of values.
– The updates are still performed continuously.
30/86
Parametric methods• Example - NIDES (14)
– All the measures are combined in the T 2
statistic.– The value is compared with a threshold to
determine if the activity is “abnormal”.– The threshold is usually set empirically, based
on the observed network behaviour in some period of time.
31/86
Parametric methods• Example - NIDES (15)
– NIDES produces a single, overall measure of “normality”, which could allow further investigation into the components that make up the statistic upon an alert.
– The problem with this is that an unusually low value for one statistic can mask a high one for another – multifaceted measures are more useful.
32/86
Parametric methods• Advantages of parametric approach (1)
– Statistical anomaly detection using parametric approach could reveal interesting, sometimes suspicious, activities that could lead to discoveries of security breaches.
– Parametric statistical systems do not require the constant updates and maintenance that misuse detection systems do.
33/86
Parametric methods• Advantages of parametric approach (2)
– However, metrics must be well chosen, adequate for good discrimination, and well-adapted to changes in behaviour (that is, changes in behaviour must produce a consistent, noticeable change in the corresponding metrics).
34/86
Parametric methods• Disadvantages of parametric approach (1)
– Batch mode processing of audit records, which eliminates the capability to perform automated responses to block damage.
– The memory and processing loads involved in using and maintaining the user/network profile knowledge base usually cause the system to lag behind audit record generation.
35/86
Parametric methods• Disadvantages of parametric approach (2)
– The nature of statistical analysis reduces the capability of taking into account the sequential relationships between events.
– The exact order of the occurrence of events is not provided as an attribute in most of these systems.
36/86
Parametric methods• Disadvantages of parametric approach (3)
– Since many anomalies indicating attack depend on such sequential event relationships, this situation represents a serious limitation to the approach.
– In cases when quantitative methods (Denning's operational model) are utilized, it is also difficult to select appropriate values for thresholds and ranges.
37/86
Parametric methods• Disadvantages of parametric approach (4)
– The false positive rates associated with statistical analysis systems are high, which sometimes leads to users ignoring or disabling the systems.
– The false negative rates are also difficult to reduce in these systems.
38/86
Non-parametric methods• One of the problems of parametric
methods is that error rates are high when the assumptions about the distribution are incorrect.
• When researchers began collecting information about system usage patterns that included attributes such as system resource usage, the distributions were discovered not to be Gaussian.
39/86
Non-parametric methods• Then, including Gaussian distribution
assumption into the measures led to high error rates.
• A way of overcoming these problems is to utilize non-parametric techniques for performing anomaly detection.
40/86
Non-parametric methods• Non-parametric approach
– provides the capability of analyzing users with less predictable usage patterns
– allows the system to take into account system measures that are not easily analyzed by parametric schemes.
41/86
Non-parametric methods• The non-parametric approach involves
non-parametric data classification techniques, specifically cluster analysis.
• In cluster analysis, large quantities of historical data are collected (a sample set) and organized into clusters according to some evaluation criteria.
42/86
Non-parametric methods• Pre-processing is performed in which
features associated with a particular event stream (often mapped to a specific user) are converted into a vector representation (for example, Xi = [f1, f2, ..., fn ] in an n-dimensional state).
43/86
Non-parametric methods• A clustering algorithm is used to group
vectors into classes by behaviours– members of each class are as close as
possible to each other – different classes are as far apart as possible.
44/86
Non-parametric methods• In non-parametric statistical anomaly
detection, the premise is that activity data, as expressed in terms of the features, fall into two distinct clusters:– a cluster indicating anomalous activity – a cluster indicating normal activity.
45/86
Non-parametric methods• Clustering algorithms
– algorithms that use simple distance measures to determine whether an object falls into a cluster
– concept-based algorithms (more complex)• an object is "scored“ according to a set of
conditions and that score is used to determine membership in a particular cluster.
46/86
Non-parametric methods• The advantages of non-parametric
approaches include the capability of performing reliable reduction of event data (in the transformation of raw event data to vectors).
• This effect may reach as high as two orders of magnitude compared to the classical approach that does not include vectors.
47/86
Non-parametric methods• Other benefits are improvement in the
speed of detection and improvement in accuracy over parametric statistical analysis.
• Disadvantages involve concerns that expanding features beyond resource usage would reduce the efficiency and the accuracy of the analysis.
48/86
Clustering in anomaly detection• Formal definition:
– Let P be a set of vectors, whose cardinality is m, and whose elements are p1,…,pm , of dimensions n1,…,nm , respectively.
– The task: partition, optimizing a partition criterion, the set P into k subsets P1,…,Pk , such that the following holds:
jikjiPPPPP
ji
k
,,,2,1,,21
P
49/86
Data pre-processor
Incoming traffic/logs
Activity data
Detection
model(s)Detection algorithm
AlertsDecision criteria Alert filter
Action/Report
Clustering!
Clustering in anomaly detection
50/86
Clustering in anomaly detection• Why should we do clustering instead of
learning?– Labelling a large set of samples is often costly.– Very large data sets – train the system with a
large amount of unlabelled data and then label with supervision, i.e. learning.
– Track slow changes of patterns in time without supervision – improves performances.
– Smart feature extraction.– Initial exploratory data analysis.
51/86
Clustering in anomaly detection• Appropriate cluster analysis algorithms (1)
– Two main classes of clustering algorithms• Hierarchical• Non-hierarchical (partitional)
– Hierarchical• Less efficient • More biased results in general
– Non-hierarchical• Results often depend on the initial partition.
52/86
Clustering in anomaly detection• Appropriate cluster analysis algorithms (2)
– A trade-off between correctness and efficiency of the CA algorithm must be found in order to achieve the real-time operation of an IDS.
– K-means algorithm – could be a good candidate for implementation in IDS.
53/86
Clustering in anomaly detection• Appropriate cluster analysis algorithms (3)
– An outline of the K-means algorithm1. Initialization: Randomly choose K vectors from the
data set and make them initial cluster centers.2. Assignment: Assign each vector to its closest
center.3. Updating: Replace each center with the mean of its
members.4. Iteration: Repeat steps 2 and 3 until there is no
more updating.
54/86
Clustering in anomaly detection• K-means algorithm
– A local optimization algorithm – hill climbing.– Clustering depends on initial centers, but
this can be overcome in several ways.– Time complexity linear in the number of
input vectors – a major advantage over e.g. hierarchical methods.
55/86
• Problems to solve– Determine the number of clusters – Determine the appropriate distance measure
Clustering in anomaly detection
56/86
• Determine the number of clusters– 2 clusters if we want only to tell “abnormal”
from “normal” behaviour.– More complex clustering evaluation
algorithms should be used to detect the number of clusters at which the most compact and separated clusters are obtained.
– Use hierarchical clustering + clustering evaluation algorithms (inefficient).
Clustering in anomaly detection
57/86
• Determine the appropriate distance measure (1)– It must be a metric:
• a,b, d (a,b ) 0• a,b, d (a,b ) = 0 a =b• a,b, d (a,b ) =d (b,a )• a,b,c, d (a,c ) d (a,b ) +d (b,c ), i.e. the triangle
inequality must hold.
Clustering in anomaly detection
58/86
• Determine the appropriate distance measure (2)– Typical metrics:
• For equal length input vectors – the Minkowski metric.
• For unequal length input vectors – the edit distance (which is also a metric).
Clustering in anomaly detection
59/86
• The Minkowski metric
qn
i
qii yxd
1
,YX
• q=1, Manhattan (city block) distance• q=2, Euclidean distance
Clustering in anomaly detection
60/86
• Edit distance– Elementary edit operations
• Deletions• Insertions• Substitutions
– Minimum number of elementary edit operations needed to transform one vector into another.
– Computed recursively, by filling the matrix of partial edit distances – edit distance matrix.
– The definition can include constraints.
Clustering in anomaly detection
61/86
Clustering in anomaly detection• Labelling clusters
– Way to determine which cluster contains normal instances and which contain attacks.
– 1st assumption• Associate the label “normal” with the cluster of the
greatest cardinality.• Fails with massive attacks, as for example the
Syn-flood attack.• Fails with KDD cup data without filtering out the
attacks.
62/86
Clustering in anomaly detection• To label properly, we need to explore the
structure of the clusters.• The clustering quality criteria are used,
combined with some characteristics of the clusters:– Silhouette index– Davies - Bouldin index– Dunn’s index– Clusters’ diameters
63/86
Clustering in anomaly detection• Intra-cluster distance
– The measure of compactness of a cluster (complete diameter, average diameter, centroid diameter, ...)
• Inter-cluster distance– The measure of separation between clusters
(single linkage, complete linkage, average linkage, centroid linkage, ...).
64/86
Example – Davies - Bouldin
• Data set for clustering
• Clustering into L clusters
• Distance between the vectors and
NXXX ,,1
LCC ,,1 CkX lX
lkd XX ,
65/86
Example – Davies - Bouldin• Davies - Bouldin index:
• Inter-cluster distance
• Intra-cluster distance
L
i ji
ji
ji CCCC
LDB
1 ,max1
C
iC
ji CC ,
66/86
Example – Davies - Bouldin• Intra-cluster distance – Centroid diameter
i
CCk
i C
sdC ik
iX
X ,2
ik
iC
ki
C Cs
XX1
67/86
Example – Davies - Bouldin
• Inter-cluster distance – Centroid linkage
ji CCji ssdCC ,,
ik
iC
ki
C Cs
XX1
jk
jC
kj
C Cs
XX1
68/86
Clustering in anomaly detection• The clusters labelling algorithm (1)
– Uses a combination of the Davies - Bouldin index of the clustering and the centroid diameters of the clusters.
– Two clusters: “normal” and “abnormal”.– Main idea (1)
• Attack vectors are often mutually very similar, if not identical.
• Consequently, the attack cluster in the case of a massive attack is very compact.
69/86
Clustering in anomaly detection• The clusters labelling algorithm (2)
– Main idea (2)• The Davies - Bouldin index of such a clustering is
either zero (non-attack cluster is empty) or very close to zero.
• The expected value of the centroid diameter of the attack cluster is smaller than that of the non-attack cluster.
70/86
Clustering in anomaly detection• The clusters labelling algorithm (3)
– Main idea (3)• Small value of the Davies - Bouldin index indicates
the existence of a massive attack• Small value of the centroid diameter indicates the
attack cluster.
71/86
Clustering in anomaly detection• The clusters labelling algorithm (4)
– Main idea (4)• A higher value of the Davies - Bouldin index
indicates that no massive attack is taking place.• Then the attack cluster is expected to be less
compact than the non-attack cluster, i.e. its centroid diameter is greater than that of the non-attack cluster (because non-massive attack vectors are very different in general).
• In this case, even the cluster cardinality can be used for proper labelling.
72/86
Clustering in anomaly detection• The clusters labelling algorithm (5)
– Input:• A clustering C of N vectors into 2 clusters, C1 and C2; C1 is the “non-attack” cluster, labelled with “1”.
• The Davies - Bouldin index threshold, DB.
• The centroid diameters difference thresholds, CD1 and CD2.
– Output:• The eventually relabelled input clustering, if
relabelling conditions are met.
73/86
Clustering in anomaly detection• The clusters labelling algorithm (6)
db = DaviesBouldingIndex(C) ;cd1 = CentroidDiameter(C1) ;cd2 = CentroidDiameter(C2) ;if (db==0)&&(cd2==0)
Relabel(C) ;else if (db>DeltaDB)&&(cd1>(cd2+DeltaCD1))
Relabel(C) ;else if (db<DeltaDB)&&((cd1+DeltaCD2)<cd2)
Relabel(C);
74/86
Example – KDD cup data base• Sample size: N=1000• Number of clusters: 2
Record No.
DB CD1 CD2 Intrusion Good labelling(K- means)
Relabel
0-1000 1.13 32759.24 7108.57 N N 2
5000-6000 0.96 4344.63 14158.54 N Y 0
7000-8000 0.14 69.6 7096.34 Y-376 N 3
8000-9000 0 25.19 0 Y-1000 N 1
75/86
Systems with learning• Two phases of system operation:
– The learning phase, in which the system is taught what a normal behaviour is.
– The recognition phase, in which the system classifies the input vectors according to the knowledge acquired in the learning process.
– These systems also include a conversion of raw data into feature vectors.
76/86
Systems with learning• Example: Neural networks (1)
– Neural networks use adaptive learning techniques to characterize anomalous behaviour.
– This analysis technique operates on historical sets of training data, which are presumably cleansed of any data indicating intrusions or other undesirable user behaviour.
77/86
Systems with learning• Example: Neural networks (2)
– Neural networks consist of numerous simple processing elements called neurons that interact by using weighted connections.
– The knowledge of a neural network is encoded in the structure of the net in terms of connections between units and their weights.
– The actual learning process takes place by changing weights and adding or removing connections.
78/86
Systems with learningA neural network
79/86
Systems with learning• Example: Neural networks (3)
– Neural network processing involves two stages. • In the first stage, the network is populated by a
training set of historical or other sample data that represent user behaviour.
• In the second stage, the network accepts event data and compares them to historical behaviour references, determining similarities and differences.
80/86
Systems with learning• Example: Neural networks (4)
– The network indicates that an event is abnormal by changing the state of the units, changing the weights of connections, adding connections, or removing them.
– The network also modifies its definition of what constitutes a normal event by performing stepwise corrections.
81/86
Systems with learning• Example: Neural networks (5)
– Neural networks don't make prior assumptions on expected statistical distribution of metrics,
– consequence: this method retains some of the advantages over classical statistical analysis associated with statistical nonparametric techniques.
82/86
Systems with learning• Example: Neural networks (6)
– Among the problems associated with utilizing neural networks for intrusion detection is a tendency to form unstable configurations in which the network fails to learn certain things for no apparent reason.
83/86
Systems with learning• Example: Neural networks (7)
– The major drawback to utilizing neural networks for intrusion detection is that neural networks do not provide any explanation of the anomalies they find.
84/86
Systems with learning• Example: Neural networks (8)
– This practice prevents the ability of users to establish accountability or otherwise address the roots of the security problems that allowed the detected intrusion.
– This made neural networks poorly suited to the needs of security managers.
85/86
Systems with learning• General problems related to all systems
with learning (1)– The problem with all learning-based
approaches is in the fact that the effectiveness of the approach depends on the quality of the training data.
– In learning-based systems, the training data must reflect normal activity for the users of the system.
86/86
Systems with learning• General problems related to all systems
with learning (2)– This approach may not be comprehensive
enough to reflect all possible normal user behaviour patterns.
– This weakness produces a large false positive error rate.
– The error rate is high because if an event does not match the learned knowledge completely, a false alarm is often generated, although it does not always happen.