View
217
Download
0
Category
Tags:
Preview:
Citation preview
Machine Learning applied to Security
Steve Poulson 25th Feb 2010
0 1 2 3 4 5 6 7 8 9 10
0
10
20
30
40
50
60
70
80
90
100
Complexity
Security Threats – Drivers and Trends
ScanSafe Threat Center
Com
plexity of Threat
Time spent online creating larger user base to steal from
Value of information transmitted Online banking, personal data theft
Communication is fragmenting alternative platforms are growing rapidly beyond email
Webmail, IM, RSS, Wikis, VoiP
New platforms are less mature and less well protected - vulnerable to attack
the more popular the application the more attention it draws
Hackers/Cyber criminals working faster AV signatures are failing Threats becoming more complex Looking for new vectors to exploit
“Email security is mature and successful, threats are migrating from inbox to
browser”
Zero-Day Threats
Social EngineeringEmail Viruses
Hybrid Worms
Spam
OS Vulnerabilities
Mobile Attacks
Identity Theft
Phishing
IM Threats
Web Viruses
DDoS Attacks
Spyware/Adware
Web Security: Trusted Sites Under Attack
Worldwide open web
Dynamic and dangerous internet
Over 127m active websites (Netcraft survey)
Graphics
Webmail
New Web Pages
BlogsAd Links
Links
Comments
Banner Ads
Backdoors
Rootkits
Trojan Horses
Keyloggers
Worms
Samsung site hijacked as malware host
Web Security: Risks of Unfiltered Content
Up to 40% of time spend online is non business related (IDC)
Productivity Bandwidth congestion Legal liability threat
37% of users visited an X-rated web site from work (Gartner)
Web Filtering Blocks by Category (%)
The Facebook Effect
“32% of our customers now blocking social
networking sites, up from 18% last year”
Effortless Management
Manage Granular Policies Directory and custom grouping Web usage quotas Schedules 50+ URL categories 60+ content types Custom block/allow lists Email and browser alerts
Generate Reports Summary Scheduled Forensic audit Blocked/allowed traffic
Ease of Management + Unrivalled SLAs
Ease and speed of deployment
Management portal
Reporting across multiple locations
Database management built in
Dedicated expert 24x7x365 support
Zero maintenance
Automated continuous updates
No patching
1. Availability
Time our service is available to scan traffic
99.999% guaranteed availability
2. Latency
Additional load time attributable to services
Evaluated by 3rd party analysis
3. False Positives
Pages that were blocked but should not have
False positive rate < 0.0004%
4. False Negatives
Pages that were not blocked, but should have
False negative rate < 0.0004%
The most comprehensive Service Level Agreements available for Web security
Proactive Security
Acceptable
Uncategorized
Prohibited
Malicious
1 in 5 searches yield Malware or Inappropriate Content
Over 90% of new sites are visited as the result of an Internet search
Trojan-Download.Win32.IstBar.jl Case Study
Provides protection in the ‘zero-hour’ Proactive threat detection The most effective scanner, sits at the
heart of all web traffic, analyses the largest amount of web traffic
Generating the most accurate heuristics in the fastest time
Outbreak Intelligence SearchAhead
Outbreak Intelligence
• Users are protected by several anti-virus engines at once• However this is not sufficient due to the variety of
exploits, and their ability to disguise themselves (polymorphism)
• Outbreak Intelligence harnesses machine-learning techniques and ScanSafe’s dataset to develop novel techniques to detect zero-hour attacks
• Uses advanced techniques such as code emulation• However we must always meet our maximum false-
positive rate of 1/250,000– Just 0.000004!
• Solutions must scale to millions of requests in real-time
Industrial Development Constraints
• Getting FP / FN right – customer expectation
• Deadlines :(• Solutions must scale to 250 million requests
per day (and growing)– Involves lots of approximations– Lookup tables in place of actual functions– Fast Data Structures– Constrains the choice of algorithms
• E.g. neural-nets or naïve bayes instead of SVMs
Industrial Development Constraints
• Dataset is continually changing– As the nature of interests across the web– And vectors targeted by attackers– Constantly change
• E.g. the latest Quicktime vulnerability targets in-request headers, by-passing virus detectors entirely
• Hence a preference for online models which can be continually updated, rather than those which have to be trained in batch.
Dataset
• Scan approx. 250 million web-requests every day
• From 45 different countries
• All traffic is logged for several months
• We can also archive traffic as it travels through our servers– Which means we can replay hacks several
days after the event to investigate them
Techniques Employed
• Supervised Learning– Support Vector Machines for classification and anomaly
detection– Some use of Neural Networks– Various probabilistic models such as Naïve Bayes variations
• Unsupervised Learning– HMMs and more complex variations thereof– Various clustering algorithms, MoG, KNNs– Dimensionality Reduction Algorithms (KPCA)
• Other– Adaboost, mixtures of experts
• Disclaimer– Not all are used in end products, and unfortunately we cannot
say which techniques are used in which applications.
Applications of Machine Learning
• Inappropriate Web Content • Drive-by attacks (first step in an attack)
– Malicious JavaScript and other scripts– Malicious Non-Executable Files
• Actual attacks– Malicious Executable Files
• Phishing– Use third-party databases– Use models that generate a probability based on URL, request and time
of a phishing attack• Reputation
– Use history of blocks for a URL, the probability of it being a phished URL, and other information, to derive a prior probability of it hosting malware to govern the decision model generating actions from the results of other classifiers
Inappropriate Content
• Basically just document classification• Want to stop Bad sites by content – Porn,
hate, ...• Good classifier naïve Bayes – Multinomial
Bernouli → Multinomial mixture model • These have problems, in practice add IR
techniques such as TF/IDF• SVM approaches better.• Also topic based – LSA / LDA
Malicious JavaScript
• Normal document classification works on the presence of “words” in files
• It’s also possible to encapsulate other information in models– E.g. Naïve Bayes classifiers for email use pseudo
words like “sender-tld:info”, “sender-tld:com” and “address-known:false”, “address-known:true” to improve accuracy
• We use similar methods with JavaScript• We extract words (though not all words)• And other features of interest• And feed these to a model
Malicious JavaScript
• Complications arise due to the extreme use of obfuscation techniques by attackers– And also legitimate vendors (e.g. Google)– And by large Web 2.0 libraries
v46f658f5e2260(v46f658f5e3226){ function v46f658f5e4207 () {return 16;} return(parseInt(v46f658f5e3226,v46f658f5e4207()));}function v46f658f5e61f4(v46f658f5e7174){ function v46f658f5ea0cd () {return 2;} var v46f658f5e813e=\'\';for(v46f658f5e9105=0; v46f658f5e9105<v46f658f5e7174.length; v46f658f5e9105+=v46f658f5ea0cd()){ v46f658f5e813e+=(String.fromCharCode(v46f658f5e2260(v46f658f5e7174.substr(v46f658f5e9105, v46f658f5ea0cd()))));}return v46f658f5e813e;} document.write(v46f658f5e61f4(\'3C5343524950543E77696E646F772E7374617475733D2\'));
• The above is JavaScript, but where are the features?– An exercise for the reader!
function startAudioFile(){try {var mmed = document.createElement("object");mmed.setAttribute("classid", "clsid:77829F14-D911-40FF-A2F0-D11DB8D6D0BC");
var mms='';for(var i=0;i<4120;i+"\x0c\x0c\x0c\x0c""\x0c\x0c\x0c\x0c\x0c\x0c\x0c\x0c") { "\x0c\x0c\x0c\x0c"= "A"; }
setSpId(3);"\x0c\x0c\x0c\x0c"="\x0c\x0c\x0c\x0c";mmed.SetFormatLikeSample(mms);} catch(e) { }};
Generating features (canonicalizing)
Tokenize and count frequencies to construct a |V| dimensional vector
001111111
fooalertcreateElement
documentmmedvartrystartAudioFilefunction
Case Study: India Times Malware Cocktail
25 Oct 07 first malware detected for ScanSafe customers
STAT team investigating Oi blocks from certain pages on the India Times website ranked 483rd by traffic
Impacted pages contain script pointing to remote site containing more iframes pointing to two further sites.
One iframe points to an encrypted script which exploits multiple vulnerabilities.
Successful exploit results in massive download of malware and assorted other files - over 434 files.
Installed malware includes cocktail of downloader/dropper Trojans, malicious binaries, scripts, cookies, other non-binaries
STAT tested binaries through VirusTotal and overall detection among signature-based AV vendors is low.
India Times notified immediately by STAT to prevent further infection - ScanSafe customers continued to be protected
“This starts an automatic chain of exploit, and all of it was
invisible to the user"
Mary Landesman - senior researcher
Malicious Non-Executable Files
• Almost no-one opens executables from odd sources any more
• So instead people use drive-by attacks– They serve a normal file (JavaScript, JPEG,
Quicktime movie, animated cursor)– Which is crafted to exploit a vulnerability in a viewer
(Internet Explorer, Quicktime, a system library that a viewer depends on)
– Which causes code embedded within the file to be executed
– Which then downloads• The actual executable• Or another program to download the main payload.
Malicious Non-Executable Files
• We’ve already covered JavaScript• But there are a lot of file formats out there• It’s not feasible to figure out the formats for all
these files themselves– So we have to write an application that can learn a file
format
• In the case of zero-day attacks, we have no data to compare against– So we can’t just create and train a simple binary
classifier
Malicious Non-Executable Files
• We’ll deal with the second element first• If we can’t train a binary classifier
– We have to train a unary classifier
• Basically this is anomaly detection– Already used in business to help detect fraud– Typically define (sometimes implicitly) a probability
distribution over all possible data• And so generate a probability of a particular datum being
“normal”• Use some decision function based on this probability to
decide whether or not to block
Malicious Non-Executable Files
• However we also have to automatically extract features from the file– Could use kernel methods (1-class SVM, bounding hypersphere)– But developing a kernel to capture the latent structure is not
easy– And may be expensive to execute
• Could use probabilistic methods– HMMs are good for sequences
• But poor at capturing long-range correlations
– Algorithms exist for capturing grammars probabilistically• But are difficult to implement• And may also be expensive in terms of runtime.
• Another exercise for the reader!
Malicious Executable Files
• The final stage of an attack is downloading an executable
• Typically blocked using signatures– Effectively quite advanced regular
expressions
• Virus writers now release several variations of their virus over its life-time
• And release viruses that change themselves as they propagate
Malicious Executable Files
• This all makes signature based approaches increasingly infeasible– F-Secure now checks every file against 500,000 signatures– McAfee now checks every file against 360,000 signatures
• The rate at which variants of viruses come out is growing rapidly– The Storm worm launched separate Christmas and New Year
versions of its attack within days of each other
• Vendors are struggling to develop techniques to detect variants using their existing technologies– But continuing to add separate signatures for each new variant
is not feasible
Malicious Executable Files
• We seek to investigate machine learning techniques to look into this.
• Several approaches have been used in the past– Typically binary classifiers using existing virus samples– Techniques include decision trees, self-organising maps, naïve-
bayes classifiers, neural networks, SVMs and others– Features are usually library includes, strings, or hex-sequences
selected using information theoretic techniques (e.g. information gain)
– Some break the executable into a graph, where nodes correspond to blocks of code (most of which are identical between variations) and perform analysis on the graphs to determine similarity.
Malicious Executable FilesWindows Portable Executable (PE) is a rich format, starts with magic number
‘MZ’ so easy to detect. This means we can quickly extract features without resorting to disassembly or flow graph construction. Some notable features:
• 60% of recent malware is obfuscated. We determined that if an executable is obfuscated, there is a greater than 95% probability that it is malware.
• An executable consists of sections, such as header, text, code and so on. There are generally fewer sections in malicious files than in non-malicious ones. In our analysis, more than 70% of the malware samples consisted of two or three sections, while more than 70% of non-malicious files consisted of four or five sections.
• Another notable feature relates to peculiarities in the executable structure – for example, some sections in the executable may not be aligned properly. In our analysis, more than 78% of malware revealed an anomaly in the executable structure, while only 5% of non-malicious samples had an anomaly in their structure. If an anomaly exists, there is a more than 93% chance that the sample is malicious.
• As part of our investigations we also calculated statistics relating to the importing of DLL files. For example, if an executable imports system32.dll, then the sample has a more than 77% chance of being malware and if it imports kernel32.dll, then the sample has a more than 67% chance of being malware.
Malicious Executable Files
As discussed there are many classification algorithms at our disposal.Currently we are using the naive-Bayes classificationalgorithm as it is both accurate and simple to implement.The simplified algorithm (assuming that there are only twoclasses: malware and non-malware) is given in Equation (1).
Where x = [x1, x2, · · · , xn] is an array of selected featuresfrom an executable, P(c|x) is the a posteriori probability thatthe executable with feature set x is in class c, and P(x|c) isthe probability of x occurring in class c.
Malicious Executable FilesWe used one group of non-malware and 28 released malware groups that had been detected by our analysis team in recent months. Each group contained around 150 to 300 samples. We plotted the results of our experiment. A smooth, dashed curve shows the recognition
We are consistently getting more than 90% accuracy detection of malware. The FPR of our system is around 10% and we are trying to reduce this by extracting new features and by developing a new feature selection algorithm.
Control flow Graph
Can be matched by a graph edit distance and nearest neighbour classifier – slow :(
Malicious Executable Files
Much like early attempts to classify email using naïve-bayes– Which concentrated only on text– Until someone thought to use the entire
context of the email, such as when it was sent, from whom, the domain and TLD of the email address etc.
– Which brings us to
Website Reputation Classifier
Gather information from context– Time of request– TLD, domain of server– Type of URL (IP Address, Domain Name)– Geographic location of server– Details of request (drive-bys may not simulate a browser)– Details of response (server may be misconfigured)– And any other information
• And use it to alter the prior probability of malware from the default 0.5
• Which may help control the FP rate.
Any Questions?
Problem Overview
• Attackers no longer rely on users launching executables
• Rely on drive-by download techniques to launch an executable without user involvement
• Examples include– JavaScript exploiting browser vulnerabilities to launch
remote executables– Website content (ANIs, WMFs, etc.) exploiting
browser and / or operating system vulnerabilities to launch remote executables
Problem Overview
• Things to look out for– Buffer overflows: extraordinarily long field values– Integer overflows: value encoded in 4 bytes is very
large:• Hard to spot!• But could be found by the absence of leading zeros in e.g. 4
byte length fields
– Exploit Code• May not resemble expected data• However raw data in some formats (JPEG, MP3) may be
relatively indistinguishable from machine-code.
Problem Specification
• Examine first 300 or so bytes of file
• Detect if it’s normal– If not normal, it’s an exploit
• System should infer file-structure itself to determine normalcy– Unfeasible for us to manually break down
every file-format into individually interesting features
Anomaly Detection
• Techniques used in machine learning and statistics to detect “outliers”: data-points (such as file content) which aren’t probable (normal)
• Two broad approaches– Non-Probabilistic Discriminative Classifiers
• Learn a function that spits out positive or negative depending on some version of the data
– Probabilistic Generative Classifiers• Find a way of estimating the actual probability of the file
being what it appears to be and use that to make a decision
Anomaly Detection :: Techniques
• Non-Probabilistic Classifiers– One-Class Support-Vector Machine (SVM) using
Sequence Kernels[Trialed, not implemented]
• Probability Density Estimation (PDE)– Hidden Markov Model (HMM)
[Implemented]– Hierarchical Hidden Markov Model (HHMM)
[Not Implemented, Not Planned]– Factorial Hidden Markov Model (FHMM)
[Not Implemented, but an avenue for future work]
Anomaly Detection :: Classifiers
• One-Class Support-Vector Machine (SVM)– If a binary (2-class) classifier draws a line between two classes– A unary classifier draws a circle around the data – everything
outside the circle is weird.
• SVMs try to find the best place to place the line, and can work around errors in the dataset
• They store the line in terms of the inputs it crosses (the “support-vectors”)
• They minimise the number of support-vectors they have to store to represent this line.
• SVMs use “kernels” to find a way of representing the data such that it’s easy to figure out where to place the line
Anomaly Detection :: Classifiers
• Kernels can also be used to convert symbolic data (such as strings and sequences) into a tractable numeric form.
• Kernels can also be chained together to help figure out “where to put the line”
ÿØÿà..JFIF.....H.H..ÿá.§Exif..MM.*........
[10, 23, 34, 0, 0, 0, 23, 0, 23, …, 0, 0, 2, 1]
Anomaly Detection :: Classifiers
• In testing performance (using string kernel) was quite poor– Needed to store a large number of support vectors to
remember where the line was– Only detected buffer overflows, not integer overflows– Couldn’t be re-trained on the go
• Arguably all these problems could be solved by a more complex kernel function– But that would increase run-time
Anomaly Detection: PDE
• Probability Density Estimation
• Return a probability for each file-header indicating how typical it is
• Approach implemented is a simple Hidden Markov Model (HMM), using various heuristics to help it fit the file-types.
• What is a HMM?
Anomaly Detection :: HMMs
• Implementation Issues: – How to jointly determine the probabilities of
• Certain characters appearing in each stage• Moving from one stage to another for all stages• Answer is the Expectation Maximisation (EM) algorithm
– How to figure out the structure of the model in advance
• “Structural Learning” problem is single major problem in machine learning
• In our case we use heuristics based on reg-exp idea.– Variable Length Sequences
• Multiply result by constant multiple of probability of file size (normally dist)
Recommended