20
2/24/14 1 Security Data Science Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park https://www.facebook.com/SDSAtUMD Introducing Your Guest Lecturer Tudor Dumitraș Office: AVW 3425 Email: [email protected]

dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

1  

Security  Data  Science  

Prof.  Tudor  Dumitraș  Assistant  Professor,  ECE  University  of  Maryland,  College  Park  

https://www.facebook.com/SDSAtUMD  

Introducing  Your  Guest  Lecturer  

Tudor  Dumitraș  Office:  AVW  3425  Email:  [email protected]  

Page 2: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

2  

My  Background  

• Ph.D.  at  Carnegie  Mellon  University  –  Research  in  distributed  systems  and  fault-­‐tolerant  middleware  

• Worked  at  Symantec  Research  Labs  –  Built  WINE  plaTorm  for  Big  Data  experiments  in  security  – WINE  currently  used  by  academic  researchers  and    Symantec  engineers  

•  Joined  UMD  faculty  

• Research  and  teaching  on  applied  security  and  systems  –  Focus  on  solving  security  problems  with  data  analysis  techniques  

WINE  

3  

We  Are  Swimming  in  Data  

• Data  created/reproduced  in  2010:  1,200  exabytes  • Data  collected  to  find  the  Higgs  boson:  1  gigabyte  /  s  • Yahoo:  200  petabytes  across  20  clusters  

• Security:    –  Global  spam  in  2011:  62  billion  /  day  

– Malware  variants  created  in  2011:  403  million  

4  

Page 3: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

3  

Why  So  Much  Data?  

• We  can  store  it  –  6¢  /  GB  –  29¢  /  GB  (SAS  HDD)  

• We  can  generate  it  – Most  data  is  machine-­‐generated  – Most  malware  samples  are  variants  of  other  malware,  generated  automaccally  (repacking,  obfuscacon)  

   What  to  do  with  all  this  data?   5  

Three  Stories  about  Data  

Page 4: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

4  

WHAT  QUESTIONS  TO  ASK  ON  A  FIRST  DATE?  The  Power  of  Big  Data  

If  You  Want  to  Know  …  Do  my  date  and  I  have  long-­‐term  poten3al?  

Page 5: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

5  

If  You  Want  to  Know  …  Do  my  date  and  I  have  long-­‐term  poten3al?  

Q  Do  you  like  horror  movies?  

Q  Have  you  ever  traveled    around  another  country  alone?  

Q Wouldn't  it  be  fun  to  chuck  it    all  and  go  live  on  a  sailboat?  

Likelihood  of  coincidence  

275,000  user  submieed  quescons  34,260  real  world  couples  

3.7×  

Data  Psychology  

…  ask:  

Top  3  user  rated  quescons,  about:  •  God  •  Sex    •  Smoking  

Source:  CNN  Money  

• eHarmony  –  Analyzes  hundreds  of  behavioral  variables,  most  collected  automaccally  

–  CTO:  former  search    engineer  at  Yahoo!  

• OkCupid          We  do  math  to  get  you  dates  –  Founded  by  Harvard  math  &  CS  majors  

• PlentyOfFish  Building  this  matching  system    was  harder  than  [being]  cited  in    the  paper  that  won  the  Fields  Medal  

Online  Da\ng  and  Big  Data  

10  

Page 6: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

6  

Early  1900s:  Most  Factories  Had  Private  Generators  

Source:  Nicholas  Carr  

Electricity  was  criccal  for  business,  but  not  widely  available  11  

Source:  OkCupid  

Is  he  an    engineer?  

Does  she  date  engineers?  

   

Data  analyccs  provide  remarkable  insight  

 

Applicacons  in  many  disciplines  

12  

Page 7: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

7  

What  Is  Data  Science?  

• Also  known  as  …  … Big  Data  analyccs  … Machine  intelligence  … Data-­‐intensive  compucng  

… Data  wrangling  … Data  munging    … Data  jujitsu    

Source:  Drew  Conway  

13  

IMPROVING  MACHINE  TRANSLATION  The  Unreasonable  Effec\veness  of  Data  

14  

Page 8: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

8  

2005  NIST  Machine  Transla\on  Compe\\on  

• Google’s  first  entry  –  None  of  the  engineers  spoke  Arabic  

• Simple  stacsccal  approach  

• Trained  using  United  Nacons  documents  –  200  million  translated  words  

–  1  trillion  monolingual  words  

English-­‐Arabic  compe\\on  

15  

For  many  hard  problems    there  appears  to  be  a    threshold  of  sufficient  data    A.  Halevy,  et  al.,  CACM  2009.  

16  

Page 9: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

9  

Challenges  for  Dealing  with  Big  Data  

• Big  Data  is  hard  to  move  around  

 

• Engineers  must  grasp  parallel  processing  techniques  –  To  access  1  TB  in  1  min,  must  distribute  data  over  20  disks  

– MapReduce?  Parallel  DB?  Dryad?  Pregel?  OpenMPI?  PRAM?  

• Engineers  must  understand  how  to  interpret  data  correctly  

Read  1  MB  sequencally  from  main  memory   150  µs    Send  1  MB  over  10  Gbps  switch   1,000  µs  Read  1  MB  from  15K  RPM  disk   1,000  µs  Compress  1  MB  w/  fast  algorithm  (e.g.,  QuickLZ,  Snappy)   3,000  μs  Send  1  MB  across  datacenter   100,000  µs  Send  1  MB  from  France  datacenter  to  Los  Angeles   9,000,000  µs  

17  

Processing  Data  in  Parallel  

• How  big  is  ‘Big  Data’?  (data  volume)    –  Real  answer:  it  depends  – When  your  manager  asks:  

• Parallelism  does  not  reduce  asymptocc  complexity  –  O(N  log  N)  algorithm  is  scll  O(N  log  N)  when  run  in  parallel  on  K  machines  

–  But  the  constants  are  divided  by  K  (and  can  have  K  >  1000)    

Relaconal  DB  •  MySQL  •  Postgres  •  etc.  

Single-­‐node  parallel  DB  

Distributed  system  •  Parallel  DB  •  MapReduce  

10-­‐20  TB  5-­‐8  TB  

1  TB  

18  

Page 10: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

10  

Data  Collec\on  Rate  

• Somecmes  the  data  colleccon  rate  is  too  high  (data  velocity)  –  It  may  be  too  expensive  to  store  all  the  data    

–  The  latency  of  data  processing  may  not  support  interaccvity  

• Example:  There  are  600  million  collisions/s  per  second  in  the  Large  Hadron  Collider  at  CERN  –  This  would  amount  to  colleccng  ~1  PB/s  (David  Foster,  CERN)  

–  They  only  record  one  in  1013  (ten  trillion)  collisions  (~  100  MB/s  –  1  GB/s)    

• Techniques  for  dealing  with  data  velocity  –  Sampling  (as  in  the  LHC)  

–  Stream  processing  

–  Compression  (e.g.  Snappy,  QuickLZ,  RLE)  •  In  some  cases  operacng  on  lightly  compressed  data  reduces  latency!  

19  

The  Curse  of  Many  Data  Formats  

• Data  comes  from  many  sources  and  in  many  formats,  oxen  not  standardized  or  even  documented  (data  variety)  –  This  is  also  known  as  the  ‘data  integracon  problem’  

• Example:  It  is  difficult  for  security  products  to  analyze  all  the    relevant  data  sources  

• A  good  approach:  schema-­‐on-­‐read  –  The  DB  way:  data  loaded  must  have  a  schema  (columns,  data  types,  constraints)  •  In  praccce,  enforcing  a  schema  on  load  means  that  some  data  is  discarded  

–  The  MapReduce  way:  store  raw  data,  parse  when  analyzing  

In  84%  of  [targeted  aeacks  between  2004-­‐2012]  clear    evidence  of  the  breach  was  present  in  local  log  files.  

     DARPA  ICAS/CAT  BAA,  2013  

20  

Page 11: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

11  

Junk  Data  is  a  Reality  

• Data  Quality  (also  called  informacon  quality,  data  veracity)  –  Can  the  data  be  trusted?    • Example:  informacon  on  vulnerabilices  and  aeacks  from  Twieer  

–  Is  there  inherent  uncertainty  in  the  values  recorded?    • Example:  anc-­‐virus  (AV)  deteccons  are  oxen  heuriscc,  not  black-­‐and-­‐white  

–  Does  the  data  colleccon  procedure  introduce  noise  or  biases?    • Example:  data  collected  using  an  AV  product  is  from  security-­‐minded  people  

21  

Agributes  of  Big  Data  

• The  3  Vs  of  Big  Data  (source:  ‘Challenges  &  Opportunices  with  Big  Data,’  SIGMOD  Community  Whitepaper,  2012)  –  Data  Volume:  the  size  of  the  data  –  Data  Velocity:  the  data  colleccon  rate  –  Data  Variety:  the  diversity  of  sources  and  formats  

• One  more  important  aeribute  –  Data  Quality:    • Are  any  data  items  corrupted  or  lost  (e.g.  owing  to  errors  while  loading)?  •  Is  the  data  uncertain  or  unreliable?  • What  is  the  stacsccal  profile  of  the  data  set?  (e.g.  distribucon,  outliers)  

–  You  must  understand  how  to  interpret  data  correctly  

22  

Page 12: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

12  

What  is  Security  Data  Science?  

• Also  known  as  …  … Security  analyccs  … Surveillance  analyccs  

• Applying  data  science  methods  to  security  problems  

23  

Security  Principles  in  60  Seconds    [J.  Saltzer  &  M.  Schroeder,  SOSP  1973]  

• Economy  of  mechanism:  Keep  the  protecGon  mechanism  as  simple  and  small  as  possible  

• Fail-­‐safe  defaults:  Base  access  decisions  on  permission  rather  than  exclusion  

• Complete  media\on:  Check  every  access  to  every  object  

• Open  design:  Do  not  keep  the  design  secret  • Separa\on  of  privilege:  Require  two  keys  to  unlock,  not  one  •  Least  privilege:  Grant  every  program/user  the  least  set  of  privileges  necessary  to  complete  the  job  

•  Least  common  mechanism:  Minimize  the  amount  of  mechanism  common  to  more  than  one  user  and  depended  on  by  all  users  

• Psychological  acceptability:  Design  interfaces  for  ease  of  use  24  

Page 13: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

13  

Security  in  Prac\ce  (Source:  C.  Nachenberg,  Symantec)  •  1986:  Simple  computer  viruses  

–  Defense:  anc-­‐virus  •  1990:  Polymorphic  viruses  (decrypcon  logic  +  encrypted  malicious  code)  

–  Defense:  “universal”  decoder,  emulacon    

•  1995:  Macro  viruses  –  Defense:  AV  vendor  cooperacon,  digital  signatures  for  macros    

•  1999:  Worms  –  Defense:  Vulnerability-­‐specific  signatures  

•  2004:  Web-­‐based  malware  –  Defense:  behavior  blocking  

•  2006:  Auto-­‐generated  malware    –  Defense:  reputacon  based  security  

•  2010  (but  probably  earlier):  Targeted  aeacks  (physical  infrastructure,  0-­‐day,  etc.)  –  Defense:  ??   25  

UNDERSTANDING  ZERO-­‐DAY  ATTACKS  The  Need  for  Security  Data  Science  

26  

Page 14: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

14  

Zero-­‐Day  Agacks:  Recent  Examples  

2009:  Operacon  Aurora  against  Google  

2010:  Stuxnet  

2011:  Aeack  against  RSA  

Zero-­‐day  agack  =  cyber  aeack  exploicng  a  soxware  vulnerability  before  the  public  disclosure  of  the  vulnerability  

27  

Price  of  Zero-­‐Day  Exploits  on  the  Black  Market  The  Economist,  March  2013  

28  

Page 15: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

15  

The  Elderwood  Project  

The Elderwood Project

Page 9

Security Response

The reuse of the identified components gives clues as to how the attackers may divide the labor amongst themselves. Technically skilled hackers (researchers) create exploits, document creation kits, re-usable trigger code (the SWF files), and compromise websites, and these are then made available to less technical attackers. These attackers (attack operators) are likely responsible for identifying targets and delivering the attack payload using the tools and infrastructure provided to them.

Once a target has been compromised, the less skilled attack operators can then proceed to move through the compromised network, identifying data of interest. The level of technical skill required to move through a compromised network is much lower than that required to establish the initial penetration.

Connecting the dotsThe investigation into the various exploits began with a deep analysis of CVE-2012-0779. From this analysis, we identified several Trojans which were dropped from documents utilizing the exploit. These Trojans helped us begin the process of establishing links between the various zero-day exploits.

The code in one of those Trojans was obfuscated in a certain way. This same obfuscation was used on a Trojan dropped by CVE-2012-1875, establishing a link between the use of these two exploits. Going back in time, the Hydraq Trojan also displayed this obfuscation.

Additional links joining the various exploits together included a shared command-and-control infrastructure. Trojans dropped by different exploits were connecting to the same servers to retrieve commands from the attackers. Some compromised websites used in the watering hole attacks had two different exploits injected into them one after the other. Yet another connection is the use of similar encryption in documents and malicious executables. A technique used to pass data to a SWF file was re-used in multiple attacks. Finally, the same family of Trojan was dropped from multiple different exploits.

Figure 7 illustrates the connections between the various exploits.

Figure 7

Links between different exploits

Group  with  “seemingly  unlimited”  supply  of  zero-­‐day  exploits  (Source:  Symantec)  

29  

Zero-­‐Day  Agacks:  Open  Ques\ons  

Decade-­‐long  open  quescons  • How  common  are  zero-­‐day  aeacks?  • How  long  can  they  remain  undiscovered?  • What  happens  ajer  disclosure?  

Creacon  

Vulnerability  \meline  

[Arbaugh  2000,  Frei  2008,    McQueen  2009,  Shahzad  2012]  

Prior  work  

Zero-­‐day  agack  

Vulnerability  disclosed  (“day  zero”)  

Exploit    used  in  aeacks  

Security    patch  released  

All  hosts  patched  

30  

Page 16: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

16  

WINE:  Big  Data  Experiments  in  Cyber  Security  

• Challenge  –  Experimental  results  representacve  of  worldwide  trends  [BADGERS’11]  • High  volume  security  telemetry  (e.g.,  16B  log  entries/day)  

• Approach  –  Parallel  DB,  queried  using  SQL  or  MapReduce  –  Distributed  sampling:  select  representacve  subset  of  hosts  • 25  TB  storage,  19B  reports/day  peak  throughput  • 50  billion  telemetry  reports  currently  available  on  WINE  

•  Impact  –  Example  experiment:  measuring  zero-­‐day  agacks    [CCS’12]  

31  

Zero-­‐Day  Agack  Findings  

•  Idencfied  18  zero-­‐day  vulnerabili\es  –  11  (61%)  not  known  before  

• Average  aeack  duracon:  312  days  (~10  months)  – Median:  239  days  (~8  months);  standard  deviacon:  246  days  

–  For  comparison:  ZDI  &  iDefense  purchase  -­‐>    disclosure:  187  days  [NSS  Labs,  Dec  2013]  

• Data  available  on  WINE,  for  independent  verificacon  

Disclosure  Months  

-­‐6  -­‐12  -­‐18  -­‐24  -­‐30  

T0  

Patch  

[CCS’12]  

32  

Page 17: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

17  

Vulnerabilices  

0  

1  

CVE-­‐2008-­‐0015  CVE-­‐2009-­‐0084  

CVE-­‐2009-­‐0561  

CVE-­‐2009-­‐0658  

CVE-­‐2010-­‐0028  CVE-­‐2010-­‐1241  

CVE-­‐2010-­‐2568  

CVE-­‐2010-­‐2862  CVE-­‐2011-­‐0618  

CVE-­‐2011-­‐1331  

CVE-­‐2010-­‐0480  CVE-­‐2008-­‐2249  

CVE-­‐2008-­‐4250  

CVE-­‐2009-­‐1134  

CVE-­‐2009-­‐2501  

CVE-­‐2009-­‐3126  

CVE-­‐2009-­‐4324  

CVE-­‐2010-­‐2883  

2  

3  

PDF  

Dura\on  of    Zero-­‐Day  Agacks  [CCS’12]  

Exploits  detected  on    <150  hosts  out  of  11M  

Require  data  analysis  at  scale  

Disclosure  Months  

-­‐6  -­‐12  -­‐18  -­‐24  -­‐30  

33  

Zero-­‐Day  Agacks:  Open  Ques\ons  (re-­‐visited)  

Creacon   Vulnerability  disclosed  (“day  zero”)  

Exploit    used  in  aeacks  

Security    patch  released  

All  hosts  patched  

Decade-­‐long  quescons:  Why  s\ll  open?  •  Rare  events,  hard  to  observe  in  small  data  sets  •  Need  data  analysis  at  scale  

Time [weeks]

Malware variants

CVE-2011-1331

CVE-2010-0028

CVE-2009-2501

CVE-2009-0561

CVE-2009-0084

CVE-2008-0015

CVE-2010-2883

CVE-2009-4324

CVE-2009-3126

CVE-2009-1134

CVE-2008-2249

CVE-2009-0658

CVE-2010-1241

CVE-2010-0480

-100 -50 t0 50 100 150

1

100

10000 CVE-2010-2862

10

1000

100000

[weeks]

Before  disclosure:  Targeted  aeacks  

Axer  disclosure:  Large-­‐scale  aeacks  

Rare  events  

34  

Page 18: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

18  

Important  Ideas  and  Findings  in  Security  Data  Science  •  Why  do  crypto  systems  fail?  

–  Implementacon  errors,  misconfiguracons,  usability  issues  [Anderson’93,  Whieen’99,  Clark’11,  Heninger’12,  Egele’13]  

•  Reputacon-­‐based  security  –  Deteccng  malware  in  a  content-­‐agnoscc  manner  [Chau’11,  AbuRajab’13,  Windows  8]  

•  Properces  of  passwords  and  the  quest  to  replace  them  –  Comparacve  evaluacon,  α-­‐guesswork,  human  factors  [Bonneau’12a,  Bonneau’12b,  Mazurek’13]  

•  Understanding  and  accouncng  for  network-­‐level  behavior  –  Network  telescopes,  BGP  security,  DNS  analyccs  [Moore’01,  Kumar’05,  Ramachandran’06,  

Antonakakis’10,  Bilge’11]  •  Aeacking  the  business  model  of  cyber  criminals  

–  Botnet  highjacking,  pay-­‐per-­‐install,  spam  value  chain,  exploit-­‐as-­‐a-­‐service  [Kanich’08,  Caballero’11,  Levchenko’11,  Grier’12]  

•  Scanning  /  infeccng  the  IPv4  Internet  in  a  few  minutes  –  Worms,  ZMap  [Staniford’02,  Durumeric’13]  

•  Anonymity  and  de-­‐anonymizacon  –  Tor,  Telex,  The  NeTlix  Prize  [Dingledine’04,  Wustrow’11,  Narayanan’08]  

Papers  available  at  http://www.umiacs.umd.edu/~tdumitra/courses/ENEE759D/Fall13/syllabus.html    

Research  in  Security  Data  Science  

Challenge  1:  Find  the  needle  in  the  haystack  –  Example:  Idencfy  and  measure  zero-­‐day  aeacks    

 

Challenge  2:  Ensure  generally  applicable  and  repeatable  results    –  The  threat  landscape  changes  frequently  

Challenge  3:  Deal  with  new  and  advanced  threats  –  Skilled  and  persistent  hackers  can  bypass  firewalls,  anc-­‐virus,  password-­‐protected  systems,  two-­‐factor  authenccacon,  physical  isolacon  

[…]  

-­‐100   -­‐50   T0   50   100   150  (weeks)  

Varia

nts  

10  103  105  

403  million  new  malware  variants    created  in  2011  

Targeted  agacks  before  disclosure  

Rare  events  

Your  thesis  topic  goes  here  36  

Page 19: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

19  

Research  in  Security  Data  Science  (cont’d)  

•  Data  quality  issues  –  Criccal  when  dealing  with  field-­‐gathered  data    –  Need  to  build  sta\s\cal  profile  of  the  data  set    

•  Helps  with  the  design  of  the  star  schema  •  Helps  with  data  cleaning  for  analyccs  

•  PlaTorm  for  federated  data  analysis  –  84%  of  targeted  aeacks  leave  traces  in  local  log  files  [DARPA  ICAS/CAT  BAA,  2013]  –  How  to  push  analy\cs  to  the  data  source  (e.g.,  enterprise  data,  personal  mobile  devices)?  –  How  to  ensure  confiden\ality  and  privacy?  

•  Difficulces  for  programming  Big  Data  techniques  –  Combinacon  of  SQL,  R,  Perl,  Map/Reduce  

–  No  informa\on  hiding,  no  inheritance  –  Axer  1000  LOC,  code  quickly  becomes  incomprehensible  

Lessons  Learned  From  WINE  Analy\cs  

37  

What  is  Security  Data  Science?  (re-­‐visited)  

• Distributed  systems  knowledge:  develop  technologies  needed  to  store  and  process  massive  data  sets  

• Sta\s\cs  &  machine  learning  knowledge:  analyze  the  data  and  extract  informacon  

• Security  knowledge:  ask  the  right  quescons  about  cyber  aeacks  

• Data  sciencsts  are  in  high  demand  in  the  cybersecurity  industry  

Booz  Allen  may  be  recruicng  more    [data  sciencsts]  than  Google  or  Facebook  

     The  Economist,  June  2013  

38  

Page 20: dumitras14 enee759L guest lecture - University of Maryland ...cpap/course/enee759l/pdf/... · 2/24/14 5 If)YouWanttoKnow)…) DomydateandIhavelongtermpotenal? Q Do#you#like#horror#movies?#

2/24/14  

20  

ENEE  757:  Security  in  Distributed  Systems  and  Networks  

• Shameless  plug  –  ENEE  757  will  be  offered  in  Fall  2014  – Will  cover  many  of  the  topics  discussed  here  

39