15
PAGE CORE SECURITY Don’t try to block out the sun with your fingers! Informa-on harves-ng with Testdriven development tools and understanding how to avoid it Nicolás Rodriguez ([email protected]) November 20 th , 2012

CORE%SECURITY% · PDF filevalidaon%(CAPTCHA,%login%validaon%or%session%removal)% • ExcessTrafficfiltering. Big%Tradeoff:% ... talk.harvesting.pptx Created Date:

Embed Size (px)

Citation preview

Page 1: CORE%SECURITY% · PDF filevalidaon%(CAPTCHA,%login%validaon%or%session%removal)% • ExcessTrafficfiltering. Big%Tradeoff:% ... talk.harvesting.pptx Created Date:

P A G E  

CORE  SECURITY  Don’t  try  to  block  out  the  sun  with  your  fingers!  

Informa-on  harves-ng  with  Test-­‐driven  development  tools  and  understanding  how  to  avoid  it  

Nicolás  Rodriguez    ([email protected])  November  20th,  2012  

Page 2: CORE%SECURITY% · PDF filevalidaon%(CAPTCHA,%login%validaon%or%session%removal)% • ExcessTrafficfiltering. Big%Tradeoff:% ... talk.harvesting.pptx Created Date:

P A G E  

Introduc-on  

•   Structure  of  the  Talk  •  Origin  of  the  Talk  •  Informa-on  Harves-ng  

•  Web  Scraping  •  Techniques  and  tools  •  Legal  Issues  •  Test-­‐Driven  Development  

•  Abusing  a  web  site  •  Code  and  Demos  

•  Mi-ga-ons  and  Tradeoffs  

•   Q&A  

2  

Page 3: CORE%SECURITY% · PDF filevalidaon%(CAPTCHA,%login%validaon%or%session%removal)% • ExcessTrafficfiltering. Big%Tradeoff:% ... talk.harvesting.pptx Created Date:

P A G E  

Who  Am  I?  

•   Programmer  since  I  was  10  years  old    

•   Network  Administrator  for  the  last    10  years  

•   Security  Consultant    at  Core  Security  since  2006  

3  

Page 4: CORE%SECURITY% · PDF filevalidaon%(CAPTCHA,%login%validaon%or%session%removal)% • ExcessTrafficfiltering. Big%Tradeoff:% ... talk.harvesting.pptx Created Date:

P A G E  

Informa-on  Harves-ng  

4  

Page 5: CORE%SECURITY% · PDF filevalidaon%(CAPTCHA,%login%validaon%or%session%removal)% • ExcessTrafficfiltering. Big%Tradeoff:% ... talk.harvesting.pptx Created Date:

P A G E  

Web  Scraping  

5  

Page 6: CORE%SECURITY% · PDF filevalidaon%(CAPTCHA,%login%validaon%or%session%removal)% • ExcessTrafficfiltering. Big%Tradeoff:% ... talk.harvesting.pptx Created Date:

P A G E  

Web  Scraping  and  Legal  Issues  

•    Web   scraping   may   be   against   the   terms   of   use   of   some  websites.  Read  the  Terms  of  Use  of  the  target  site!  

•   The  enforceability  of  these  terms  is  unclear:  •  While  outright  duplica-on  of  original  expression  will  in  many  cases  be  

illegal,  in  the  U.S.  the  courts  ruled  that  duplica-on  of  facts  is  allowable  

•  U.S.  courts  have  acknowledged  that  "scrapers“  may  be  held   liable  for  commicng  trespass.  

•  In  Denmark  systema-c  crawling,  indexing  and  deep  linking  does  not  to  conflict   with   Danish   law   or   the   database   direc-ve   of   the   European  Union  (2006)  

•  In   Australia,   the   Spam   Act   2003   outlaws   some   forms   of   web  harves-ng,  although  this  only  applies  to  email  addresses.  

6  

Page 7: CORE%SECURITY% · PDF filevalidaon%(CAPTCHA,%login%validaon%or%session%removal)% • ExcessTrafficfiltering. Big%Tradeoff:% ... talk.harvesting.pptx Created Date:

P A G E  

Scraping  Techniques  

•   HTTP    GET  /  POST  (wget  /  curl)  •  Single  request  per  page  •  Cookies  and  session  handling  

•   XSRF  Token  /  Authoriza-on  Tokens  •  Mul-ple  requests  per  page  

•   JavaScript  Rendering  (i.e.  Google  Search)  •  Render  page  using  local  JavaScript  Engine  (i.e.  v8)  

•   Test-­‐Driven  Development  tools  (i.e.  Selenium)  •  DOM  parsing:  FindItem  /  Xpath  •  Source  code  grepping  •  JavaScript  Injec-on  

7  

Page 8: CORE%SECURITY% · PDF filevalidaon%(CAPTCHA,%login%validaon%or%session%removal)% • ExcessTrafficfiltering. Big%Tradeoff:% ... talk.harvesting.pptx Created Date:

P A G E  

Test-­‐Driven  Development  Tools  (TDD)  

•   What  is  TDD?  •  Selenium  

•  Remote  Automa-on  

•  WebDriver  

•   Uses  •   Web  Applica-on  test  cases  •   Browser  automa-on  

•  Whatever  a  user  is  able  to  do,  this  tool  could  reproduce  it  

•  Web  scraping  

8  

Page 9: CORE%SECURITY% · PDF filevalidaon%(CAPTCHA,%login%validaon%or%session%removal)% • ExcessTrafficfiltering. Big%Tradeoff:% ... talk.harvesting.pptx Created Date:

P A G E  

Code  and  Demos  

9  

Page 10: CORE%SECURITY% · PDF filevalidaon%(CAPTCHA,%login%validaon%or%session%removal)% • ExcessTrafficfiltering. Big%Tradeoff:% ... talk.harvesting.pptx Created Date:

P A G E  

Ini-al  Setup  

•   Python  2.7  (hnp://www.python.org)  

•   Selinum  Webdriver  (hnp://seleniumhq.org/)  

•   JQuery  (hnp://www.jquery.org)  

•   Mozilla  Firefox  (hnp://www.mozilla.com)  

1 0  

Page 11: CORE%SECURITY% · PDF filevalidaon%(CAPTCHA,%login%validaon%or%session%removal)% • ExcessTrafficfiltering. Big%Tradeoff:% ... talk.harvesting.pptx Created Date:

P A G E  

Code  and  Demos  

•   Basic  Web  Automa-on  •   Element  handling    

•   DOM  parsing    •   Source  code  grepping  •   Web  Sites  Screenshots  

•   Injec-on  JavaScript  libraries  into  a  running  site  •   Using  JQuery  to  parse  DOM  elements  

1 1  

Page 12: CORE%SECURITY% · PDF filevalidaon%(CAPTCHA,%login%validaon%or%session%removal)% • ExcessTrafficfiltering. Big%Tradeoff:% ... talk.harvesting.pptx Created Date:

P A G E  

Possible  Mi-ga-ons  and  Trade-­‐offs  

Public  Pages:  FORGET  ABOUT  IT!  

•   Authen-cated  Pages:  •  Quotas:  Limit  the  number  of  opera-ons  on  a  given  -me  span  

•  Track  behavior:  any  ac-on  outside  the  standard  usage  triggers  a  user  valida-on  (CAPTCHA,  login  valida-on  or  session  removal)  

•  Excess  Traffic  filtering.  

Big  Tradeoff:    Performance  vs  Informa-on  Protec-on  

1 2  

Page 13: CORE%SECURITY% · PDF filevalidaon%(CAPTCHA,%login%validaon%or%session%removal)% • ExcessTrafficfiltering. Big%Tradeoff:% ... talk.harvesting.pptx Created Date:

P A G E  

Things  to  watch  out  for…  

•   Don’t  trust  IP  address  •  Anonymous  services  (TOR)  

•  Mul-ple  IP  addresses    

•   Quotas  could  be  useless  •  Adjust  based  on  the  type  of  informa-on  you  manage  

•  Web  scraping  could  be  automated  to  be  as  stealth  as  possible  

•   Standard  Behavior  •  There’s  a  fine  line  between  a  user  and  a  good  “scrapper”  •  Avoid  damaging  site’s  usability.  BIG  ISSUE!  

•   No  informa6on  is  safe!  •  If  a  user  can  see  it,  it  can  be  scrapped!  

1 3  

Page 14: CORE%SECURITY% · PDF filevalidaon%(CAPTCHA,%login%validaon%or%session%removal)% • ExcessTrafficfiltering. Big%Tradeoff:% ... talk.harvesting.pptx Created Date:

P A G E  

Q&A  

1 4  

Page 15: CORE%SECURITY% · PDF filevalidaon%(CAPTCHA,%login%validaon%or%session%removal)% • ExcessTrafficfiltering. Big%Tradeoff:% ... talk.harvesting.pptx Created Date:

P A G E  

Thank  you  

1 5