Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
2/24/14
1
Security Data Science
Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park
https://www.facebook.com/SDSAtUMD
Introducing Your Guest Lecturer
Tudor Dumitraș Office: AVW 3425 Email: [email protected]
2/24/14
2
My Background
• Ph.D. at Carnegie Mellon University – Research in distributed systems and fault-‐tolerant middleware
• Worked at Symantec Research Labs – Built WINE plaTorm for Big Data experiments in security – WINE currently used by academic researchers and Symantec engineers
• Joined UMD faculty
• Research and teaching on applied security and systems – Focus on solving security problems with data analysis techniques
WINE
3
We Are Swimming in Data
• Data created/reproduced in 2010: 1,200 exabytes • Data collected to find the Higgs boson: 1 gigabyte / s • Yahoo: 200 petabytes across 20 clusters
• Security: – Global spam in 2011: 62 billion / day
– Malware variants created in 2011: 403 million
4
2/24/14
3
Why So Much Data?
• We can store it – 6¢ / GB – 29¢ / GB (SAS HDD)
• We can generate it – Most data is machine-‐generated – Most malware samples are variants of other malware, generated automaccally (repacking, obfuscacon)
What to do with all this data? 5
Three Stories about Data
2/24/14
4
WHAT QUESTIONS TO ASK ON A FIRST DATE? The Power of Big Data
If You Want to Know … Do my date and I have long-‐term poten3al?
2/24/14
5
If You Want to Know … Do my date and I have long-‐term poten3al?
Q Do you like horror movies?
Q Have you ever traveled around another country alone?
Q Wouldn't it be fun to chuck it all and go live on a sailboat?
Likelihood of coincidence
275,000 user submieed quescons 34,260 real world couples
3.7×
Data Psychology
… ask:
Top 3 user rated quescons, about: • God • Sex • Smoking
Source: CNN Money
• eHarmony – Analyzes hundreds of behavioral variables, most collected automaccally
– CTO: former search engineer at Yahoo!
• OkCupid We do math to get you dates – Founded by Harvard math & CS majors
• PlentyOfFish Building this matching system was harder than [being] cited in the paper that won the Fields Medal
Online Da\ng and Big Data
10
2/24/14
6
Early 1900s: Most Factories Had Private Generators
Source: Nicholas Carr
Electricity was criccal for business, but not widely available 11
Source: OkCupid
Is he an engineer?
Does she date engineers?
Data analyccs provide remarkable insight
Applicacons in many disciplines
12
2/24/14
7
What Is Data Science?
• Also known as … … Big Data analyccs … Machine intelligence … Data-‐intensive compucng
… Data wrangling … Data munging … Data jujitsu
Source: Drew Conway
13
IMPROVING MACHINE TRANSLATION The Unreasonable Effec\veness of Data
14
2/24/14
8
2005 NIST Machine Transla\on Compe\\on
• Google’s first entry – None of the engineers spoke Arabic
• Simple stacsccal approach
• Trained using United Nacons documents – 200 million translated words
– 1 trillion monolingual words
English-‐Arabic compe\\on
15
For many hard problems there appears to be a threshold of sufficient data A. Halevy, et al., CACM 2009.
16
2/24/14
9
Challenges for Dealing with Big Data
• Big Data is hard to move around
• Engineers must grasp parallel processing techniques – To access 1 TB in 1 min, must distribute data over 20 disks
– MapReduce? Parallel DB? Dryad? Pregel? OpenMPI? PRAM?
• Engineers must understand how to interpret data correctly
Read 1 MB sequencally from main memory 150 µs Send 1 MB over 10 Gbps switch 1,000 µs Read 1 MB from 15K RPM disk 1,000 µs Compress 1 MB w/ fast algorithm (e.g., QuickLZ, Snappy) 3,000 μs Send 1 MB across datacenter 100,000 µs Send 1 MB from France datacenter to Los Angeles 9,000,000 µs
17
Processing Data in Parallel
• How big is ‘Big Data’? (data volume) – Real answer: it depends – When your manager asks:
• Parallelism does not reduce asymptocc complexity – O(N log N) algorithm is scll O(N log N) when run in parallel on K machines
– But the constants are divided by K (and can have K > 1000)
Relaconal DB • MySQL • Postgres • etc.
Single-‐node parallel DB
Distributed system • Parallel DB • MapReduce
10-‐20 TB 5-‐8 TB
1 TB
18
2/24/14
10
Data Collec\on Rate
• Somecmes the data colleccon rate is too high (data velocity) – It may be too expensive to store all the data
– The latency of data processing may not support interaccvity
• Example: There are 600 million collisions/s per second in the Large Hadron Collider at CERN – This would amount to colleccng ~1 PB/s (David Foster, CERN)
– They only record one in 1013 (ten trillion) collisions (~ 100 MB/s – 1 GB/s)
• Techniques for dealing with data velocity – Sampling (as in the LHC)
– Stream processing
– Compression (e.g. Snappy, QuickLZ, RLE) • In some cases operacng on lightly compressed data reduces latency!
19
The Curse of Many Data Formats
• Data comes from many sources and in many formats, oxen not standardized or even documented (data variety) – This is also known as the ‘data integracon problem’
• Example: It is difficult for security products to analyze all the relevant data sources
• A good approach: schema-‐on-‐read – The DB way: data loaded must have a schema (columns, data types, constraints) • In praccce, enforcing a schema on load means that some data is discarded
– The MapReduce way: store raw data, parse when analyzing
In 84% of [targeted aeacks between 2004-‐2012] clear evidence of the breach was present in local log files.
DARPA ICAS/CAT BAA, 2013
20
2/24/14
11
Junk Data is a Reality
• Data Quality (also called informacon quality, data veracity) – Can the data be trusted? • Example: informacon on vulnerabilices and aeacks from Twieer
– Is there inherent uncertainty in the values recorded? • Example: anc-‐virus (AV) deteccons are oxen heuriscc, not black-‐and-‐white
– Does the data colleccon procedure introduce noise or biases? • Example: data collected using an AV product is from security-‐minded people
21
Agributes of Big Data
• The 3 Vs of Big Data (source: ‘Challenges & Opportunices with Big Data,’ SIGMOD Community Whitepaper, 2012) – Data Volume: the size of the data – Data Velocity: the data colleccon rate – Data Variety: the diversity of sources and formats
• One more important aeribute – Data Quality: • Are any data items corrupted or lost (e.g. owing to errors while loading)? • Is the data uncertain or unreliable? • What is the stacsccal profile of the data set? (e.g. distribucon, outliers)
– You must understand how to interpret data correctly
22
2/24/14
12
What is Security Data Science?
• Also known as … … Security analyccs … Surveillance analyccs
• Applying data science methods to security problems
23
Security Principles in 60 Seconds [J. Saltzer & M. Schroeder, SOSP 1973]
• Economy of mechanism: Keep the protecGon mechanism as simple and small as possible
• Fail-‐safe defaults: Base access decisions on permission rather than exclusion
• Complete media\on: Check every access to every object
• Open design: Do not keep the design secret • Separa\on of privilege: Require two keys to unlock, not one • Least privilege: Grant every program/user the least set of privileges necessary to complete the job
• Least common mechanism: Minimize the amount of mechanism common to more than one user and depended on by all users
• Psychological acceptability: Design interfaces for ease of use 24
2/24/14
13
Security in Prac\ce (Source: C. Nachenberg, Symantec) • 1986: Simple computer viruses
– Defense: anc-‐virus • 1990: Polymorphic viruses (decrypcon logic + encrypted malicious code)
– Defense: “universal” decoder, emulacon
• 1995: Macro viruses – Defense: AV vendor cooperacon, digital signatures for macros
• 1999: Worms – Defense: Vulnerability-‐specific signatures
• 2004: Web-‐based malware – Defense: behavior blocking
• 2006: Auto-‐generated malware – Defense: reputacon based security
• 2010 (but probably earlier): Targeted aeacks (physical infrastructure, 0-‐day, etc.) – Defense: ?? 25
UNDERSTANDING ZERO-‐DAY ATTACKS The Need for Security Data Science
26
2/24/14
14
Zero-‐Day Agacks: Recent Examples
2009: Operacon Aurora against Google
2010: Stuxnet
2011: Aeack against RSA
Zero-‐day agack = cyber aeack exploicng a soxware vulnerability before the public disclosure of the vulnerability
27
Price of Zero-‐Day Exploits on the Black Market The Economist, March 2013
28
2/24/14
15
The Elderwood Project
The Elderwood Project
Page 9
Security Response
The reuse of the identified components gives clues as to how the attackers may divide the labor amongst themselves. Technically skilled hackers (researchers) create exploits, document creation kits, re-usable trigger code (the SWF files), and compromise websites, and these are then made available to less technical attackers. These attackers (attack operators) are likely responsible for identifying targets and delivering the attack payload using the tools and infrastructure provided to them.
Once a target has been compromised, the less skilled attack operators can then proceed to move through the compromised network, identifying data of interest. The level of technical skill required to move through a compromised network is much lower than that required to establish the initial penetration.
Connecting the dotsThe investigation into the various exploits began with a deep analysis of CVE-2012-0779. From this analysis, we identified several Trojans which were dropped from documents utilizing the exploit. These Trojans helped us begin the process of establishing links between the various zero-day exploits.
The code in one of those Trojans was obfuscated in a certain way. This same obfuscation was used on a Trojan dropped by CVE-2012-1875, establishing a link between the use of these two exploits. Going back in time, the Hydraq Trojan also displayed this obfuscation.
Additional links joining the various exploits together included a shared command-and-control infrastructure. Trojans dropped by different exploits were connecting to the same servers to retrieve commands from the attackers. Some compromised websites used in the watering hole attacks had two different exploits injected into them one after the other. Yet another connection is the use of similar encryption in documents and malicious executables. A technique used to pass data to a SWF file was re-used in multiple attacks. Finally, the same family of Trojan was dropped from multiple different exploits.
Figure 7 illustrates the connections between the various exploits.
Figure 7
Links between different exploits
Group with “seemingly unlimited” supply of zero-‐day exploits (Source: Symantec)
29
Zero-‐Day Agacks: Open Ques\ons
Decade-‐long open quescons • How common are zero-‐day aeacks? • How long can they remain undiscovered? • What happens ajer disclosure?
Creacon
Vulnerability \meline
[Arbaugh 2000, Frei 2008, McQueen 2009, Shahzad 2012]
Prior work
Zero-‐day agack
Vulnerability disclosed (“day zero”)
Exploit used in aeacks
Security patch released
All hosts patched
30
2/24/14
16
WINE: Big Data Experiments in Cyber Security
• Challenge – Experimental results representacve of worldwide trends [BADGERS’11] • High volume security telemetry (e.g., 16B log entries/day)
• Approach – Parallel DB, queried using SQL or MapReduce – Distributed sampling: select representacve subset of hosts • 25 TB storage, 19B reports/day peak throughput • 50 billion telemetry reports currently available on WINE
• Impact – Example experiment: measuring zero-‐day agacks [CCS’12]
31
Zero-‐Day Agack Findings
• Idencfied 18 zero-‐day vulnerabili\es – 11 (61%) not known before
• Average aeack duracon: 312 days (~10 months) – Median: 239 days (~8 months); standard deviacon: 246 days
– For comparison: ZDI & iDefense purchase -‐> disclosure: 187 days [NSS Labs, Dec 2013]
• Data available on WINE, for independent verificacon
Disclosure Months
-‐6 -‐12 -‐18 -‐24 -‐30
T0
Patch
[CCS’12]
32
2/24/14
17
Vulnerabilices
0
1
CVE-‐2008-‐0015 CVE-‐2009-‐0084
CVE-‐2009-‐0561
CVE-‐2009-‐0658
CVE-‐2010-‐0028 CVE-‐2010-‐1241
CVE-‐2010-‐2568
CVE-‐2010-‐2862 CVE-‐2011-‐0618
CVE-‐2011-‐1331
CVE-‐2010-‐0480 CVE-‐2008-‐2249
CVE-‐2008-‐4250
CVE-‐2009-‐1134
CVE-‐2009-‐2501
CVE-‐2009-‐3126
CVE-‐2009-‐4324
CVE-‐2010-‐2883
2
3
Dura\on of Zero-‐Day Agacks [CCS’12]
Exploits detected on <150 hosts out of 11M
Require data analysis at scale
Disclosure Months
-‐6 -‐12 -‐18 -‐24 -‐30
33
Zero-‐Day Agacks: Open Ques\ons (re-‐visited)
Creacon Vulnerability disclosed (“day zero”)
Exploit used in aeacks
Security patch released
All hosts patched
Decade-‐long quescons: Why s\ll open? • Rare events, hard to observe in small data sets • Need data analysis at scale
Time [weeks]
Malware variants
CVE-2011-1331
CVE-2010-0028
CVE-2009-2501
CVE-2009-0561
CVE-2009-0084
CVE-2008-0015
CVE-2010-2883
CVE-2009-4324
CVE-2009-3126
CVE-2009-1134
CVE-2008-2249
CVE-2009-0658
CVE-2010-1241
CVE-2010-0480
-100 -50 t0 50 100 150
1
100
10000 CVE-2010-2862
10
1000
100000
[weeks]
Before disclosure: Targeted aeacks
Axer disclosure: Large-‐scale aeacks
Rare events
34
2/24/14
18
Important Ideas and Findings in Security Data Science • Why do crypto systems fail?
– Implementacon errors, misconfiguracons, usability issues [Anderson’93, Whieen’99, Clark’11, Heninger’12, Egele’13]
• Reputacon-‐based security – Deteccng malware in a content-‐agnoscc manner [Chau’11, AbuRajab’13, Windows 8]
• Properces of passwords and the quest to replace them – Comparacve evaluacon, α-‐guesswork, human factors [Bonneau’12a, Bonneau’12b, Mazurek’13]
• Understanding and accouncng for network-‐level behavior – Network telescopes, BGP security, DNS analyccs [Moore’01, Kumar’05, Ramachandran’06,
Antonakakis’10, Bilge’11] • Aeacking the business model of cyber criminals
– Botnet highjacking, pay-‐per-‐install, spam value chain, exploit-‐as-‐a-‐service [Kanich’08, Caballero’11, Levchenko’11, Grier’12]
• Scanning / infeccng the IPv4 Internet in a few minutes – Worms, ZMap [Staniford’02, Durumeric’13]
• Anonymity and de-‐anonymizacon – Tor, Telex, The NeTlix Prize [Dingledine’04, Wustrow’11, Narayanan’08]
Papers available at http://www.umiacs.umd.edu/~tdumitra/courses/ENEE759D/Fall13/syllabus.html
Research in Security Data Science
Challenge 1: Find the needle in the haystack – Example: Idencfy and measure zero-‐day aeacks
Challenge 2: Ensure generally applicable and repeatable results – The threat landscape changes frequently
Challenge 3: Deal with new and advanced threats – Skilled and persistent hackers can bypass firewalls, anc-‐virus, password-‐protected systems, two-‐factor authenccacon, physical isolacon
[…]
-‐100 -‐50 T0 50 100 150 (weeks)
Varia
nts
10 103 105
403 million new malware variants created in 2011
Targeted agacks before disclosure
Rare events
Your thesis topic goes here 36
2/24/14
19
Research in Security Data Science (cont’d)
• Data quality issues – Criccal when dealing with field-‐gathered data – Need to build sta\s\cal profile of the data set
• Helps with the design of the star schema • Helps with data cleaning for analyccs
• PlaTorm for federated data analysis – 84% of targeted aeacks leave traces in local log files [DARPA ICAS/CAT BAA, 2013] – How to push analy\cs to the data source (e.g., enterprise data, personal mobile devices)? – How to ensure confiden\ality and privacy?
• Difficulces for programming Big Data techniques – Combinacon of SQL, R, Perl, Map/Reduce
– No informa\on hiding, no inheritance – Axer 1000 LOC, code quickly becomes incomprehensible
Lessons Learned From WINE Analy\cs
37
What is Security Data Science? (re-‐visited)
• Distributed systems knowledge: develop technologies needed to store and process massive data sets
• Sta\s\cs & machine learning knowledge: analyze the data and extract informacon
• Security knowledge: ask the right quescons about cyber aeacks
• Data sciencsts are in high demand in the cybersecurity industry
Booz Allen may be recruicng more [data sciencsts] than Google or Facebook
The Economist, June 2013
38
2/24/14
20
ENEE 757: Security in Distributed Systems and Networks
• Shameless plug – ENEE 757 will be offered in Fall 2014 – Will cover many of the topics discussed here
39