Upload
truongdien
View
226
Download
6
Embed Size (px)
Citation preview
P A G E
CORE SECURITY Don’t try to block out the sun with your fingers!
Informa-on harves-ng with Test-‐driven development tools and understanding how to avoid it
Nicolás Rodriguez ([email protected]) November 20th, 2012
P A G E
Introduc-on
• Structure of the Talk • Origin of the Talk • Informa-on Harves-ng
• Web Scraping • Techniques and tools • Legal Issues • Test-‐Driven Development
• Abusing a web site • Code and Demos
• Mi-ga-ons and Tradeoffs
• Q&A
2
P A G E
Who Am I?
• Programmer since I was 10 years old
• Network Administrator for the last 10 years
• Security Consultant at Core Security since 2006
3
P A G E
Informa-on Harves-ng
4
P A G E
Web Scraping
5
P A G E
Web Scraping and Legal Issues
• Web scraping may be against the terms of use of some websites. Read the Terms of Use of the target site!
• The enforceability of these terms is unclear: • While outright duplica-on of original expression will in many cases be
illegal, in the U.S. the courts ruled that duplica-on of facts is allowable
• U.S. courts have acknowledged that "scrapers“ may be held liable for commicng trespass.
• In Denmark systema-c crawling, indexing and deep linking does not to conflict with Danish law or the database direc-ve of the European Union (2006)
• In Australia, the Spam Act 2003 outlaws some forms of web harves-ng, although this only applies to email addresses.
6
P A G E
Scraping Techniques
• HTTP GET / POST (wget / curl) • Single request per page • Cookies and session handling
• XSRF Token / Authoriza-on Tokens • Mul-ple requests per page
• JavaScript Rendering (i.e. Google Search) • Render page using local JavaScript Engine (i.e. v8)
• Test-‐Driven Development tools (i.e. Selenium) • DOM parsing: FindItem / Xpath • Source code grepping • JavaScript Injec-on
7
P A G E
Test-‐Driven Development Tools (TDD)
• What is TDD? • Selenium
• Remote Automa-on
• WebDriver
• Uses • Web Applica-on test cases • Browser automa-on
• Whatever a user is able to do, this tool could reproduce it
• Web scraping
8
P A G E
Code and Demos
9
P A G E
Ini-al Setup
• Python 2.7 (hnp://www.python.org)
• Selinum Webdriver (hnp://seleniumhq.org/)
• JQuery (hnp://www.jquery.org)
• Mozilla Firefox (hnp://www.mozilla.com)
1 0
P A G E
Code and Demos
• Basic Web Automa-on • Element handling
• DOM parsing • Source code grepping • Web Sites Screenshots
• Injec-on JavaScript libraries into a running site • Using JQuery to parse DOM elements
1 1
P A G E
Possible Mi-ga-ons and Trade-‐offs
Public Pages: FORGET ABOUT IT!
• Authen-cated Pages: • Quotas: Limit the number of opera-ons on a given -me span
• Track behavior: any ac-on outside the standard usage triggers a user valida-on (CAPTCHA, login valida-on or session removal)
• Excess Traffic filtering.
Big Tradeoff: Performance vs Informa-on Protec-on
1 2
P A G E
Things to watch out for…
• Don’t trust IP address • Anonymous services (TOR)
• Mul-ple IP addresses
• Quotas could be useless • Adjust based on the type of informa-on you manage
• Web scraping could be automated to be as stealth as possible
• Standard Behavior • There’s a fine line between a user and a good “scrapper” • Avoid damaging site’s usability. BIG ISSUE!
• No informa6on is safe! • If a user can see it, it can be scrapped!
1 3
P A G E
Q&A
1 4
P A G E
Thank you
1 5