Biblio

1 Bibliography

1.1 Introduction

Client-side vulnerabilities, down-by attacks, malicious servers, today, we are litterally over-whelmed and the web becomes more unsecured every day. Many tools have been developed towork against that, and the one we decided to deal with in this paper is obvioulsy one of the mostimportant, the client honeypot. A honeypot is a security device that is designed to lure mali-cious activity to itself. Capturing such malicious activity allows for studying it to understand theoperations and motivations of attackers, and subsequently helps to better secure computers andnetworks. A honeypot does not have any production value. Its a security resource whose value liesin being probed, attacked, or compromised [6]. Since it does not have any production value, anynew activities or network traffic that comes from the honeypot indicate that a honeypot has beensuccessfully compromised. Theoritically speaking, false positives, as commonly found on traditionalintrusion detection systems, do not exist on honeypots, even there are particular cases which willbe discuss later.

Honeypots can be splitted in two categories, honeypot servers and clients. The purpose of thisbibliography is to clarify and understand the ins and outs of the latter which came out recentlywith projects like HoneyC or Capture-HPC for example. Basically, client honeypots crawl the webto find and identify web servers that exploit client-side vulnerabilities. Instead of passively awaitingto be attacked, client honeypots actively crawl the web to search for servers that exploit the clientas part of the server response.

They can belong either to ”low-interaction Honeypot clients”, or to ”high-interaction Honeypotclients”. The idea of client honeypots was first articulated by Lance Spitzner in June 2004. Afew client honeypots have been created since then, namely Honeymonkey, HoneyClient, HoneyC orCapture-HPC. These implementations crawl the web with a real browser and perform analysis forexploit based on the state of the OS (such as monitoring changes to the file system, configuration,and process list) Since these implementations make use of real systems, they are classified as highinteraction client honeypots. On the other hand, low-interaction honeypots make use of emulatedclients and analysis engine that might make use of an algorithm other than OS state inspection likesignature matching.

1.2 Low interaction honeypots

Low-interaction client honeypots have many assets. They can be contained in a stand-aloneapplication so installation is highly simplified. They are also faster than high-interaction clienthoneypots because they are based on emulated services. We will evaluate this new approach goingthrough three different speaking of mainly MonkeySpider but also HoneyC[3] and SpyBy[2].

1.2.1 Monkey-Spider, a good example.

MonkeySpider, this is a recent project led by Ali Ikinci [1]. As we will see, it appears thatMonkeySpider is the more successfully completed. Therefore will take this project as an example toexplain how the different blocks work together and we will flesh it out with HoneyC and SpyBye.

This application can be divided in three parts :The Seeder

The seeder block generates starting URLs for the crawler, that is what we call ”the queue”.There are three different ways to seed: First you can use the search engines (Yahoo, Google orMSN), you just have to precise a keyword like ”Warez” or ”pirate”, then you configure how manyURLs you want back. The communication is performed via web services. Second, you can usethe URLs out of Spam mails. If you setup a ”SpamTrap” which is an email account establishedspecially to collect spam, you can extract URLs contained in the messages to use them as seedsfor your crawl. However MonkeySpider does not examine attached files. Third, you can re-queuethe URLs you’ve already processed thanks to the ”monitorDB seeder”. This feature is used to

1

constantly reseed previously found malicious content over time from the malware database. Someothers low-interaction honeyclients offer the possibilities to write in statically the URLs like HoneyC.

The Crawling

MonkeySpider picked up ”Heritrix” to be their crawler. Heritrix is the Internet Archive’s opensource Web crawler project. It is multi-threaded, very extensible and easily customizable. Heritrixcrawls contents optionally through an interconnected Web proxy. The web proxy can be used toincrease performance and avoid getting duplicate crawling. Heritrix queues the generated URLsfrom the seeder and stores the crawled contents on the file server while generating detailed log files.On the other hand SpyBye for example does not offer any crawler. SpyBye is acting when youbrowse the web before displaying the website you are trying to access.

The Analysis engine

The analysis process can be execute on a different computer. Information are extract form thefile server and analyse with different anti-virus solutions and malware analysis tools. MonkeySpidercan be configure to use Avast, ClamAV, F-prot. Then identified malware and malicious websites aresaved in a special directory and related malicious information are also stored in a MySQL database.Finally, binary and Javascript files are copied in an additional archive directory for additionalresearch purposes using CWsandbox. Honeyc analysis engine is based on Snort signature matchingand SpyBye allows a web master to determine whether a web site is malicious by a set of heuristicsand scanning of content against the ClamAV engine.

1.2.2 Limitation and further work

There are a part of the web completely hidden, what is commonly named ”the Deep Web”.According to Bergman (ref a mettre) The public information on the Hidden Web is 400 to 550times larger than the publicly indexable Web in 2001. Those pages have not a static URLs, therecomposed of forms and authentication mechanisms and honeyclients do not process it.

We can also mention the obfuscation code which is created dynamically and because low-interaction honeyclient does not execute the scripts, we can not catch dynamically generated linksand content.

Finally we have to mention that ”topsites” are crawled more than less-know ones. So this coulddo a difference in the result. A further work would be to improve the crawler to avoid browsingalways the same websites like (Amazon, Youtube....)

1.3 High interaction honeypots

Contrary to low intereaction honeypots, high interaction honeypot aims to recreate a real en-vironment : real system and applications browse the web like a normal user would do. Change ofstate of the system are logged and analysed. These solutions are based upon virtualization solu-tions to easily revert the system into a clean state in case of corruption. The processing time ofhigh interaction honeypot is a matter of concern due to the different delay implied by using realsystems. [5]

High interaction client honeypots scroll the web in order to collect data that generates variationof the host state. Theses variations can refer to files creation, registry modification, changes inthe process list of the system. The data collected from these system is more accurate than lowinteraction honeypot: these systems detects zero day exploits [4], but in the mean time, they aremore complex systems that require a lot of administration.

The three components architecture previously introduced for the low interaction honeyclientcan also be applied to high interaction honeyclient. High interaction honey clients use a queuer, avisitor, and an analysis engine to perform its task.

The queuer elects the URL to browse and gives them to the visitor witch in our case is mainlya web browser. The last component monitors state changes and decide either the attack has beensuccessful or not.

2

1.3.1 Web honeypot

Honeypot like Capture-HPC is a client-server system that permit a global administration of aset of client honeypot. The client implements the navigator and the analysis engine. It uses theVMware server solution to handle reverting the virtual system in case of corruption. The serverimplements the queuer. It also receives results from the honeyclient system and log them. Resultsare sent to Capture-HPC server in real time. Capture-HPC is able to monitor multiple instances ofInternet Explorer Browser and detects which instances are compromising the system. This process iscalled the bulk navigation algorithm.[4]. However, this system does not work with Firefox. Capture-HPC also permit to interact with software other than web browser like instant messaging clients,pdf readers and office suite like Microsoft Office. Capture-HPC checks for changes in system files,registry and processes.

MITRE honeyclient checks for changes in system and application files and also in the registry.It is one of the first high honey client developed.

The honeypot Spycrawler from the university of washington is a proprietary crawler developedin 2005. It focuses on spyware and uses google as an entry point with its search results. The majorinconvenient of this solution is that it refers to the AdAw are spyrare database to evaluate a website.

Zero day exploit detection is also the purpose of high interaction client honeypots. HoneyMon-key [7] uses several honeypot client at different state of patch to detect this kind of attacks. Itimplements a bulk and sequential algorithm to perform its analysis : a group of URL is browsed inparallel. If the system is compromised, each URL is checked individually by different systems withdifferent state of security patch application. That permits to detect zero day exploit.

The high interaction honey client Pezzonavante use this system of different patch levels operatingsystem. It also uses integrity checks, security tool scans (anti virus and anti spyware), IDS, trafficanalysis and snapshots comparisons. It is a fast honeyclient because it only do monthly integritychecks but it is known to miss some alerts.

There are also other solution like SiteAdvisor from McAfee that uses high interaction honeyclientto get results.

1.3.2 Mail honeypot

The SHELIA aims to provide a mail honeypot. It opens mails via an MUA like Outlook Expressand typically scroll the spam folder searching for malicious content and urls that are opened in webbrowsers.

1.3.3 Performance issue

The main issue while using High interaction honeypots is speed. Usually, scans performed aimto define if a web server is malicious. Running multiple scans on different servers improve speedbut does not permit to identify precisely malicious servers. Algorithms like divide and conquer forhigh interaction honeypots can improve speed up to 72% [5]

1.4 Honeyweb

In this part, we will deal with Honeyweb to introduce the project and see how it could be usefulfor the Honeyclient users community.

Honeyweb is a web plateform based on Java technologies allowing Honeyclient users to manageand run honeypots remotly thanks to the Web. It is also useful to centralize and agregate log filesfrom heterogeneous Honeyclients. Indeed Honeyweb is still in developement and all the function-alities are not implemented yet. But eventualy, thanks to web services, Honeyweb will be able tocommunicate with different Honeyclients, on different systems with parallel threads.

If we succeed in gathering a large number of Honeyclient users around Honeyweb, we could getbetter results by crossing them together. Here are some milestones that Honeyweb should reach:

3

Name creation date monitored enti-ties

specificity

Capture-HPC unknown files, process, reg-istry

client/server architecture thatallow parallelling browsing,interaction with multiplebrowsers (ie, opera, firefox)and others programs (adobereader, openoffice)

Honey Monkey unknown files, process, reg-istry

this solution forwards a ma-licious URL from unpachedhoneyclient to fully patchedhoneyclient : it permits to de-tect zero day exploit.

HoneyClient 2004 files, registry this method include a scoringsystem that permits to makea hierarchy between scrolledurls

UW Spycrawler 2005 spyware infec-tions

It only uses AdAware spywareto detects spyware

Table 1: Specificities of high interaction honeyclient

First, the number of Honeyclient registered is very important, the more we have the largeramount of URLs we can analize. And as we saw in the previous part, one of the limitation is theWeb keeps changing every time. So from a wide crawling of the web we could agregate the logs todetermine the history of websites. For example: ”Figure out when they became malicious?”.

Second, the number of different Honeyclients we have. If we get many different types, like high-interaction and low-interaction but also Honeyclients from different continents, different countries,this would be very useful.Indeed, this heterogeneity allows us to examine deeply each URL, forexample we can firstly browse a large range of IP adresses with low-interaction Honeyclients andthen we redirect the malicious ones to high-interaction Honeyclients for a deeper analysis.

1.5 conclusion

The World Wide Web is a sytem of interlinked, hypertext documents that runs over the Internet.We assume the whole Wide Web as a set of available content on computer systems with a uniqueIP address at a fixed time. We call our set the web. The first drawback in every research on thisset is that it does change significantly over time and is not predictable to any database or similar.Therefore we can not ”dump” the web at a particular point in time, so we crawl content and extractas much links as possible from this content and try to follow these as far as useful.

Finally we should mention that malicious server can hide themself from Honeyclient client rangeusing blacklist. That is why it is very important to not reveal IP addresses, locations or anythingrelated to honeyclients which could compromise their efficiency. Additionally, future malware coulduse vulnerabilities in utilized malware analysis tools to report back the honeyclient’s IP rangesor even detect the type of the honeyclient. Contrary to blacklisting dangerous Web sites by theInternet community, such a blacklisting of ”dangerous” Web clients for malicious Web site operatorscould become common to malware site operators.

References

[1] A. Ikinci, T. Holz, F. Freiling, and G. Mannheim. Monkey-Spider: Detecting Malicious Websiteswith Low-Interaction Honeyclients. Sicherheit, Saarbruecken, 2008.

4

[2] Niels Provos. Spybye: Finding malware.

[3] C. Seifert, I. Welch, and P. Komisarczuk. Honeyc-the low-interaction client honeypot. Proceed-

ings of the 2007 NZCSRCS, Waikato University, Hamilton, New Zealand, 2007.

[4] Christian Seifert. Cost-effective Detection of Drive-by-Download Attacks with Hybrid Client

Honeypots. PhD thesis, Victoria University of Wellington, 2009.

[5] Christian Seifert, Ian Welch, and Peter Komisarczuk. Application of divide-and-conquer algo-rithm paradigm to improve the detection speed of high interaction client honeypots. In SAC

’08: Proceedings of the 2008 ACM symposium on Applied computing, pages 1426–1432, NewYork, NY, USA, 2008. ACM.

[6] L. Spitzner. Honeypots: tracking hackers. Addison-Wesley Professional, 2003.

[7] Y.M. Wang, D. Beck, X. Jiang, R. Roussev, C. Verbowski, S. Chen, and S. King. Automatedweb patrol with strider honeymonkeys. In Proceedings of the 2006 Network and Distributed

System Security Symposium, pages 35–49. Citeseer, 2006.

5

Documents

Biblio