Understanding Threats: a Prerequisite to Enhance Survivability of Computing Systems

Understanding Threats: a Prerequisite to Enhance Survivability

of Computing Systems

F. Pouget, M. Dacier, V.H. PhamInstitut Eurecom

B.P. 193, 06904 Sophia Antipolis, FRANCEEmail: {pouget, dacier, pham}@eurecom.fr

Abstract

This paper aims at showing the usefulness of sim-ple honeypots to obtain data that can be used to de-rive analytical models of the attack processes presenton the Internet. Built upon an environment whichhas been deployed for 18 months, we provide fig-ures and analyses that enable us to better under-stand how attacks are carried out in the wild. Keycontributions of this paper include a critical reviewof geographical information provided by NetGeo, astudy of the aftermath of the Deloder worm and anin-depth analysis of the interaction between the pop-ulations of compromised machines devoted to scanthe Internet and the ones in charge of actually run-ning the attacks.

1. Introduction

The mere existence of the WORM Symposium in-dicates that Internet-wide infectious epidemics haveemerged as one of the leading threats to informationsecurity and service availability. Important contri-butions have been made in the past that have pro-posed propagation models [1, 2, 3] or that have ana-lyzed, usually by reverse engineering specific worms,their modus operandi [4, 5, 6, 7]. A few initia-tives have been taken to monitor real world data re-lated to worms and attacks propagation. The Inter-net telescopes and the DShield web site are amongthem. These approaches are extremely valuable butwe will show in this paper that it is also worth usingmuch simpler mechanisms, namely low interactionhoneypots, to complement the type of informationthey provide.

In this paper, we will show the usefulness of thedata collected by simple honeypots to formulate andvalidate assumptions regarding the propagation notonly of worms but also of other types of malicioustools. Thanks to the data gathered in an environ-

ment which has been deployed for more than 18months, we will give a few examples of the kindof findings that can be derived from this data set.We also want to point out that the collection ofsystematic data on which various detection and re-action mechanisms can be tested is a prerequisitefor the evaluation of novel techniques, and thus forenhancing survivability of IP networks.

The long term goal of our research is to deploysimilar setups in many different places to see if iden-tified threats are identical on a worldwide basis orif, on the contrary, specificities exist. Results pre-sented in this paper justify such a deployment bythe richness of information that can be obtained.The simplicity of the setup makes it easy to deployit at almost no cost in a large number of places.The results presented here are based on a VMWareenvironment which runs on a recent machine withat least one GB of memory [8]. This setup can notbe replicated at no cost. However, as our results in-dicate that most attacks do not make use of knownos fingerprinting techniques, we can, with a verylow probability of introducing a bias in our exper-iments, replace this expensive platform by anotherone, based on honeyd which is easier to fingerprintbut which runs on an old PC equipped with only256 MBytes of memory [9]. We have deployed suchan environment, in parallel to the VMWare one, tosuccessfully confirm this assumption. In this paperthough, as we have accumulated a much longer pe-riod of data with the VMWare platform, we presentthe results obtained with that one.

This paper is a follow up to three others we havepublished in the past [10, 11, 12]. Earlier pub-lications have used a 10 months data set (March2003 until December 2003) while this one is basedon a 16 months data set (February 2003 until May2004). With respect to previous work, we presentfour novel key contributions, namely:

1. The most recent months of observation confirmthe findings discussed in previous publications.This long term stability is a very strong argu-ment in favor of those who try to build analyt-ical models of the Internet threats.

2. Previous results have presented Australia asthe main source of attacks. We found outthat this was an artifact caused by the toolchosen to obtain the geographical information,namely NetGeo [13]. Revised results obtainedby means of another tool, MaxMind, are pro-posed.

3. We investigate the aftermath of the Deloderworm which appears to be very atypical andwhich, as far as we know, has never been dis-cussed publicly so far. This analysis openssome avenues for further investigation.

4. We present an experiment carried out to con-firm the assumptions made in our previouspublications: two distinct sets of compromisedmachines are used to run attacks. The firstset is in charge of scanning the net without at-tacking machines while the second set uses theinformation provided by the first one to run theattacks. The results of this recent experimentare discussed.

The structure of the paper is as follows. Sec-tion 2 presents two other solutions (Internet tele-scopes and DShield) to collect data about ongoingattacks and discuss the added benefits of our ap-proach. Section 3 briefly presents our own environ-ment and summarizes previous results. Section 4presents the four novel contributions of this paper.Section 5 concludes the paper.

2. State of the Art

For the sake of conciseness, we refer the reader,interested in a complete treatment of the state ofthe art concerning honeypots and data collection,to our previous publications [10, 11, 12]. Still, webriefly cover here below two major ongoing initia-tives related to the collection of data related to realworld attacks: Dshield and the Internet telescopes.

2.1.DShield Project

The main idea of the DShield project [14] is togather in a central place logs from a large number offirewalls. A freely available web site offers variousrepresentations of the database which can also be

queried thanks to a user friendly interface. DShieldoperates in association with the SysAdmin, Audit,Network, Security (SANS) Institute which hosts theInternet Storm Center and a similar web site [15].Another noticeable project is myNetWatchman (see[16]), a free service which aggregates firewall logrecords, backtraces the activity to its source (if pos-sible) and automatically sends escalation emails tothe responsible party (ISPs for instance). However,all these projects present similar restrictions thatare explained below, taking DShield as a concrete,illustrative, example.

These approaches deliver trends of attacks oc-curring in the Internet at large but our experi-ence shows that these macroscopic values can hideimportant local differences. Therefore, these datamust be used with care. As an example of local vs.global differences, we present in Table 1 two differ-ent views of the 10 most attacked ports in Franceduring the first week of June 2004, by decreasingorder of appearance. The first column shows dataobtained on the DShield site while the second one isbased on what our honeypots have observed. Thisleads to the two following comments:

1. Both environments consider port 445 as the top1 attacked port.

2. They only have 6 out 10 ports in common andthe ports that appear in both lists are rankedvery differently. This highlights the fact thatDShield only provides global trends that canbe different from one place to another, depend-ing on many factors (geography, network type,domain name, etc) that remain to be identifiedand quantified.

3. Firewalls logs miss some important informationthat are needed to identify attack tools. As ex-plained in [14], different tools target the sameports or even the same sequences of ports. Fur-thermore, some ports will only be targeted ifsome others are open1. This explains the dif-ference observed in the table for, e.g., the port4444 which is only scanned by the Blaster wormif port 135 on the same machine is open. Sincethat port is closed on most firewalls, Dshieldrecords do not show many hits against port4444. This is different for our honeypots whereport 135 is open and where, as a consequence,

1See Section 5.3 of [17] for more details on Blaster Wormattacks. The infection always follows the same general pat-tern: a small set of attack packets obtain initial results, andfurther network traffic follows, either from the egg develop-ment, or from subsequent scans.

Dshield trends Local trends445 445135 139139 1371433 1359898 44443127 10261434 10275554 14331025 4899137 9898

Table 1: Dshield Port trends compared to our localobservations: ports are listed in decreasing order

many requests are sent to port 4444 as well.This highlights the fact that, in order to ana-lyze data, the collecting environment must beprecisely defined and tuned. Collecting logsfrom a variety of firewalls which are configuredin many different ways, in very diverse environ-ments, limits the analysis to very wide spreadphenomena, in the best case, and refrains usfrom seeing others in the worst case.

2.2.Internet Telescopes

Another approach concerns large telescopes. Theyconsist basically of a large piece of globally an-nounced IPv4 addresses (most of them are unused)and a dedicated engine that monitor all traffic tothese addresses. The most well known one is devel-oped by the CAIDA Project (see [18] for a good ap-plication of telescopes on the analysis of the Wittyworm). This project was the very first one, to ourknowledge, to start investigating rigorously some ofthe threats found on the Internet. Their seminalwork on the widespread usage of DDoS attacks [19]was a great source of inspiration to design our ownapproach.

This telescope, and others such as the one de-scribed in [20], provide very large amount of in-formation. These data sets are very interestingto analyze worms propagation [5]. However, thislarge volume forbids them from maintaining de-tailed information about each packet (such as thepayload, flags, sequence numbers, etc.). Unfortu-nately, such information might be important to dif-ferentiate traces due to different tools. Also, thegeneralization of the results obtained thanks to agiven telescope is subject to controversy. How canwe be sure that the phenomena observed on a given

set of addresses are representative of those hap-pening elsewhere? Since the trends provided byDShield indicate that differences exist between con-tinents and countries, one should deploy telescopesin many different places in the world to come upwith a good representation of the ongoing attacks.This is clearly not feasible due to the complexityand cost of such telescopes.

As we can see, both types of approaches havemerits and drawbacks. The setup we are workingon aims at complementing the data already avail-able. By having a simple honeypot setup deployedin many places in the world, we can provide somerefined analysis of the attacks observed and, hope-fully, correlate or provide explanations for data setscollected by the other projects.

In the following, we detail the kind of results thatcan be derived from a single setup to justify theusefulness of deploying similar ones, yet not usingVMWare anymore, on a large scale.

3. Experimental Setup

3.1.The Observation Platform

Our environmental setup consists in a virtual net-work built on top of VMWare [8] to which threevirtual machines, or guests in the VMWare termi-nology, are connected (a Windows 98 workstation, aWindow NT server and a Linux RedHat 7.3 server).It is very important to understand that the setup issuch that these machines can only be partially com-promised. They can accept certain connections onopen ports but they can not, because of a firewall,initiate connections to the outside world. Further-more, as they are built on non-persistent disks [8],changes are lost when virtual machines are pow-ered off or reset. We do reboot the machines veryfrequently, almost on a daily basis, to clean themfrom any infection that they could have had. Thethree machines are attached to a virtual Ethernetswitch. ARP spoofing is implemented so that theycan be reached from the outside world. A fourth vir-tual machine is created to collect data in the virtualnetwork. It is also attached to the virtual switch2

and tcpdump is used as a packet collector [21]. Inaddition, a dedicated firewall controls the access toall machines (iptables [22]) and tripwire regularlychecks integrity of the host files. More detailed in-formation about the setup can be found in [10].

2what VMWare calls a switch is in fact a hub

3.2.Data Collection and Storage

In the following, we will make use of the expres-sions attack source and ports sequence which we de-fine as follows:

• Attack Source : an IP address that targets ourenvironment within one day. The same IP ad-dress seen in two different days counts for twodifferent attack sources (see [12] for more onthis).

• Ports Sequence: an ordered list of ports tar-geted by an attack source. For instance, ifsource A sends requests on port 80 (HTTP),and then on ports 8080 (HTTP Alternate) and1080 (SOCKS), the associated ports sequencewill be {80;8080;1080}.

All network packets are stored in a databaseand enriched with geographical information of thesources as well as their OS fingerprints. Thedatabase is made of several tables that are opti-mized to efficiently handle complicated queries. Inother words, the computing cost of querying thedatabase is marginally influenced by the number oflogs in it. Of course, the price to pay is increaseddisk and memory space to handle redundant metainformation (keys, counters, etc.) in several tables.We report the interested reader to [12] for more in-formation on the design of the database.

3.3.Attack Tools identification

In [12], we have presented a clustering techniquewhich enables us to identify clusters of traces causedby different attack tools.The different steps are basically:

1. Group attack sources per ports sequences

2. For each group, gather important parametersin a DB table

3. Some parameters are generalized by means ofa hierarchy algorithm

4. Clusters are extracted thanks to data miningtechniques (association rules)

5. Validation of clusters consistency is madethanks to the distance phrase algorithm (Lev-enshtein); inconsistent clusters are splitted andchecked again until they satisfy the validationcriteria.

6. Output: one cluster corresponds to one attacktool

7. Find the name of the tool associated to thecluster (tools fingerprinting)

The algorithm used lies beyond the scope of thispaper. In the following, when we use the notionof clusters, we refer to the results obtained by ap-plying this algorithm to our database. It is impor-tant to keep in mind that each cluster is due toonly one attack tool but also that one given attacktool can generate different traces which could begrouped into several clusters. In other words, therelation that links clusters to attack tools is a N to1 relationship.

4. New Results

4.1.The big picture

In the previous publications, we have shown anddiscussed early results obtained by querying ourdatabase and applying the clustering algorithm ona database containing 10 months of data. With 50%more data, we are still able to confirm all results ob-tained so far. This surprising stability over a longperiod of time is extremely encouraging for thosewho wish to obtain macroscopic analytical modelsof the Internet threats. Of course, as we will seehere after, things are much more complicated andless stable when one tries to dig into the data tobetter understand the root causes of these stableprocesses.

To highlight this regularity, we provide a few keyfigures obtained over this 16 months period and werefer the reader to our previous work ([11, 10, 12])to verify by himself how stables these numbers are:

• 4102635 packets from/to our virtual machineshave been stored in the database.

• 28722 attack sources have been observed.

• Only 205 different ports have been probed.

• Only 604 different ports sequences have beenobserved

• All attack sources, but a few very rare excep-tions, have sent requests during less than 30seconds.

• In 69% of the cases, attacks sources have sentrequests to the 3 honeypots, always in the sameorder.

• In 6%, attack sources have contacted only twoout of the three machines.

• In 25% they have focused on only one of thethree honeypots.

• 99.5% of the IP addresses have been seen inonly one day.

Last but not least, Figure 1 shows the amountof attack sources observed per country per month.The important thing to note here is the impres-sive uniformity of the distribution. A few countriessystematically compose the top 7 of the attackingcountries: China, the USA, Japan, Taiwan, France,Germany and Korea.

Though, it is important to stress that we donote an important discrepancy during the last twomonths for the attacks originating from China andthe USA. A dramatic increase is observed for thesedata sets which cannot be easily explained by theappearance of a single worm. Furthermore, a deeperanalysis reveals that a large number of these attackstarget only one of our three honeypots. This is avery new pattern of attack that we had not observedbefore. On one hand, this seems to contradict ourclaim that regular patterns exist and that analyticalmodels could therefore be proposed for the existingthreats. On the other hand, we have been able toidentify this new type of attack pattern because itdid not fall within the regular pattern. Thus, de-spite this new unstable activity, we maintain thatdata sets obtained by means of simple honeypotsare amenable for analytical modeling of the attackprocesses present on the Internet.

4.2.NetGeo vs. MaxMind

We discussed in [10] some information on attack-ing machines, and more specifically on their geo-graphical location. In that paper, we have pre-sented results obtained thanks to the NetGeo util-ity, developed in the context of the CAIDA project[13]. NetGeo is made of a database and a collec-tion of perl scripts that map IP addresses to geo-graphical locations. This utility is open-source andhas been applied in several research papers amongwhich [23, 24, 25, 26]. In our case, running NetGeoscripts on our data base indicated that 70% of theattacks originated from only three countries: Aus-tralia (27.7%), the USA (21.8%) and the Nether-lands (21.1%). This was quite surprising but veryconsistent month after month. As a result of thesefirst publications and discussions with our peers, wehave decided to use another system that also pro-vide geographical location, namely MaxMind [27],to double check the results obtained with NetGeo.

Figure 1: # of attack sources per country and permonth

Results are presented in Figure 2 for the top 4 at-tacking countries according to NetGeo: Australia,Netherlands, USA and ’unknown’. For all IP ad-dresses that NetGeo indicated to us as belonging toone of these countries we have checked the resultsreturned by MaxMind. The Figure must be read asfollows: In the upper left pie chart, one can see that29% of the IP addresses that NetGeo has located inAustralia are located in China by MaxMind.

If we consider the results of MaxMind instead ofthose of NetGeo, our top 4 countries are now re-placed by a group of 7 countries, namely: USA, Ger-many, China, France, Japan, Taiwan and Korea (seetable 2). This seems to better fit with the expecta-tions of the security community. For Australia andthe Netherlands, MaxMind gives completely differ-ent results than NetGeo. According to MaxMind,almost no attacks are originating from Australiaanymore but they, instead, seem to come from fourother countries: China, Japan, Taiwan and SouthKorea. Moreover, the Netherlands are replaced bytwo main countries: Germany and France. It is alsoworth pointing out that, in our data set, both toolsdisagree for 61% of all IP addresses. Also, even ifthey agree, for the total amount of IP addresses lo-cated in the USA, 21.8% vs. 22.3%, there are impor-tant differences regarding the specific state withinthe USA where they locate the IP addresses.

Figure 2: top 4 attacking countries according toNetGeo compared to MaxMind results

The explanation of these differences lies in theprecise understanding of the information returnedby both tools. The structure of Internet is based oninter-connected autonomous systems (ASs), whereeach AS is administrated by a single authority withits own choice of routing protocols, configuration,and policies. For a few large, geographically dis-persed ASs, NetGeo returns the geographical loca-tion of the authority of the AS to which belongs theIP instead of the real location of the IP itself. Thisis what happens for RIPE (Netherlands) and AP-NIC (Australia). However, this is not what NetGeois supposed to return, as described in [13]. At thistime of writing, this apparent erroneous behavior isnot clear to us and might lead to misleading infor-mation3. It is perhaps due to some simple codingerror that the people in charge of the NetGeo scriptscould hopefully fix relatively easily. On the otherhand, according to one of the MaxMind representa-tives, their system uses user-entered location dataaggregated with ’whois’ data to better estimate thelocations of the end users. These data are obtainedthanks to data exchange agreements with some oftheir customers.

306/28/2004, Bugtraq mailing list: information on theScob Trojan indicates that IP addresses of infected machinesare mostly located in the USA and Australia [28]. The authorindicates that the information is based on APNIC; it is quitelikely that here too Australia is blamed for no good reason...

Countries NetGeo MaxMindAustralia 27.7 0.6

China 1.5 10.1France 4.1 6.1

Germany 1.9 12.0Great Britain 0.0 2.5

Italia 1.1 2.7Japan 0.2 5.7

Netherlands 21.1 1.1South Korea 0.7 4.8

Spain 0.5 2.0Taiwan 0.7 5.5

US 21.8 22.3Others 18.7 24.6

Table 2: NetGeo vs. MaxMind against our honey-pot data (% of observed attack sources per country)

The most important consequence of using the in-formation returned by MaxMind instead of NetGeois that Australia, which was, by far, our top attack-ing country now disappears from the map.

4.3.Deloder worm and aftermath

Figure 3: Australian attacks observed each month,based on the NetGeo utility

One of the questions left unanswered in [10] wasthe apparent decrease of attacks coming from Aus-tralia around July 2003. This is represented in Fig-

ure 3. Figure 4 represents the same curves, based onMaxMind data, for all addresses identified as Aus-tralian ones by NetGeo.

Figure 4: Addresses identified as Australian by Net-geo: Repartition by country over months

We notice that attacks are coming from four Asiancountries, respectively China, Japan, Taiwan andSouth Korea. We were expecting to find the expla-nation of the ’Australian’ decrease in a change dueto one of these countries, as a consequence, for in-stance to some increased control of traffic flowingout of the country. However, Figure 4 shows thatthe decrease exists for all countries in mid-2003. Inaddition, we have applied the clustering algorithmpresented in [12] which shows that the decrease ismainly due to a few specific clusters, all of theminvolving ports sequence {445}. The tool associ-ated to these clusters has been identified as beingthe worm Deloder [6, 7]. In Figure 5, we repre-sent, per month and per country, the amount ofattack sources compromised by the Deloder wormthat have tried to propagate to our honeypots.

Each represented cluster can be linked to one ofits variants. This worm, which spreads over Win-dows 2K/XP machines, attempts to copy and ex-ecute itself on remote systems, via accessible net-work shares. It tries to connect to the IPC$ share4

and uses specific passwords. According to antivirus4or ADMIN$, C$, E$ shares depending on the Deloder

variants

websites like [6, 7], this worm, which was initiallydetected in March 2003, originates from Asia, andmore especially China. This is consistent with ourcurves.

The surprising fact comes from the rapid de-crease of its propagation around July 2003. [5]mentions that the shutdown of CodeRedII was pre-programmed for October 1, 2001. [29] mentionsthat Welchia worm will self terminate on June 1st,2004, or after running 120 days. A similar mecha-nism could have been used for Deloder but, as faras we know, no one has ever made mention of itpublicly. In the absence of such a mechanism, itis worth trying to imagine the reasons for such asudden death. We have come with the followingscenario:

1. Deloder is still active but our virtual machinesare not scanned anymore, for some unknownreasons. Statistically speaking, this seems un-likely and should be validated by means ofother similar platforms.

2. All machines have been patched. Deloder hasbeen eradicated. This is another unlikely sce-nario since Deloder targets a large number ofplatforms, many of them being personal com-puters which will probably never be patched.Newer successful worms targeting the sameport (eg Sasser, Welchia, the Korgo family,etc.) tend to confirm this.

3. Deloder bots are listening on IRC channels forcommands to run attacks. One of these com-mands might have told them to stop the propa-gation process. In this case, the Deloder wormis not visible anymore but its botnet remainsas dangerous as before.

At this point in time, unless if a pre programmedshutdown is included in the Deloder worm code,we consider the third option as the most plausibleone. A definitive answer to that question could bebrought forward by someone who has access to theDeloder worm code, which we have not. If our as-sumption holds true, this would imply that wormswriters have developed a new strategy. Instead ofcontinuously trying to compromise more machines,they have decided to enter into a silent mode whenthe size of their botnets is sufficient. By doing so,they dramatically reduce the likelihood of seeing anin-depth study of their worm being done as invisibleworms are definitely less interesting to the securitycommunity than virulent ones. The bottom line ofour findings is that such an in-depth analysis of that

worm is probably worse being done if it has not beendone yet. Sleeping worms might actually be moresophisticated and nefarious than active ones.

Figure 5: Deloder Activity (# asociated attacksources) observed over months

4.4.Scanners and Attackers

4.4.1. Preliminary NoteIn [11], we have shown that approximately 70% of

the observed IP addresses have sent requests to our3 honeypots while 25% have sent packets to only onehoneypot with an unexpected success rate. Apartfrom some identified exceptions (backscatters essen-tially), all IPs belonging to this category have sentpackets to ports that were actually open! Statisti-cally speaking, it is very unlikely to see this phe-nomenon. As a consequence, we have postulatedthat among the 70% of machines talking to our threehoneypots, the role of some of them was restrictedto scan our machines without trying any kind of at-tacks against them. The result of this informationgathering process was then later used by machinesthat we had never seen before, enabling them to hitsystematically our machines on open ports.

In order to validate that claim, to have a betterunderstanding of the ratio of machines involved inthe scanning-only process and to try to figure outhow many populations of scanners we are facing,we have designed an experiment the results ofwhich are presented here below.

4.4.2. ExperimentStarting mid October 2003, we have opened port

1433 on our linux virtual machine5, which is the tra-ditional port for the MS SQL service on Windowsmachines. Before that day, we had never observeda machine talking to that sole linux box sendingpackets to that port. We were interested in verify-ing that the situation would change once the ’hypo-thetical’ scanners would have figured out that thisport was now open.

The results we have obtained are given in Figure6. It represents the number of sources that targetedthe sole port 1433 on that unique machine. We notethat they are all new IPs. None had been observedpreviously. Point 1 in the Figure corresponds tothe date we opened this port. Point 2 shows thefirst explicit attack observed on that sole machine.It reveals that it takes around 15 days for such aprecise attack to happen. Also, it is worth point-ing out that port 1433 has been opened on a Linuxmachine while this port is normally used by a Mi-crosoft service. Thus, this indicates that scanningmachines provide some basic information regardingthe opened ports but fail in fingerprinting the OSof the machine they have just probed. This mightbe too costly in regards of the probability of havinga Windows port opened on a Linux box. More-over, the increasing number of specific attacks showthat we are facing different attackers communities.Otherwise, they would not try to attack the samemachine without success again and again. Clearly,these attackers did not share the experience of theirfailures.

At the end of December 2003, the number of at-tacks still increases but less abruptly. Thus, we havedecided to close this port on january 12th 2004.This is indicated by the point 3 in Figure 6. Wenote then a very fast decrease of such observed at-tacks. They almost totally disappear in Februaryand there is none of them in March. This simplymeans that some scanners have updated the sharedinformation, and it takes less than two weeks forthe attackers’ community to update their informa-tion and to stop the attacks.

In summary, this experiment allows deducing fourmajor results:

1. The claim about scanners-only machines is val-idated.

5We started a daemon listening on that port. There is noservice attached to it

Figure 6: # attack sources having targeted port1433 only on the sole Linux virtual machine

2. The shared information is not as good as itcould be. Many tools are now implementingOS fingerprinting techniques (actively [30] orpassively [31, 32, 9]): either scanners havingtargeted our machines are not using them, orthe attackers communities do not check this in-formation before launching their attacks. Thesecond hypothesis seems to be the more cor-rect. Indeed, we observe many scans on port1433 following other Windows ports like 139,445 or 135. As a consequence, such scannersknow whether the machine is a Windows sta-tion or not.

3. The previous point justifies the usage of sim-pler honeypots than VMWare ones to replicateour environment. Indeed, if attackers do notbother distinguishing between Windows andLinux boxes, it is quite likely that are not moreinterested in detecting honeypots. We hadinitially build our environment on a VMWareplatform to avoid introducing any bias in thedata collection process. Now armed with theseresults, we have good reasons to shift to acheaper and simpler environment, despite itslimitations with respect to fingerprinting. Wewill, though, maintain a few VMWare machinesto verify that observations obtained in both en-vironments keep being consistent.

4. This experiment gives a rough estimate of thenumber of communities of attackers. Indeed,

we can assume each attack is independent andcomes from a different person or group of per-sons. It seems unlikely that someone would re-peatedly try an unsuccessful attack. In ourcase, 76 independent communities have at-tacked our machines without sharing any pieceof information. We are in the process of refin-ing these numbers thanks to a larger numberof platforms over a longer period of time.

Having validated the four previous points, we cananalyze a little bit further this phenomenon by us-ing our clustering algorithm in order to determinepotential scanners that have made this shared infor-mation available. We can reasonably assume thatthe information about open ports is maintained upto date by a single tool. In other words, the samescanning tool is responsible for having identified theports as open and, later for having changed theinformation regarding that port when it found itclosed again. With this assumption in mind, weobserve the following facts:

1. From the date we opened the port (point 1 inFigure 6, October 20, 2003) to the date we ob-served the first explicit attack (point 2 in Fig-ure 6 November 6th, 2003), the clustering algo-rithm shows that five different sets of machineshave targeted port 1433 among others on our 3honeypots.

2. We observe only two of these five clusters be-tween the ’closing date’ of the port (point 3)and the decrease of the attacks (point 4).

The scanning tool which shares information withattack comunities is quite likely one of these twoclusters (if not both). Figure 7 gives the observationof these two clusters from October 2003 to March2004. Scans grouped in the first cluster (Cluster 1:the upper curve) are observed frequently and reg-ularly. Those in the second cluster (Cluster 2) ap-pear, at the contrary, rather sporadically.

Instances of Cluster 1 are seen every day. Thus,if Cluster 1 is the one that contains the scans thathave led to the following direct attacks, it is diffi-cult to understand why it took two weeks to see thefirst attack launched. On the other hand, the firstattack is observed less than 5 days after the firstscan belonging to Cluster 2. Moreover, the porthas been closed on January 12th (point 3) and thefirst Cluster 2 scan has been observed a dozen dayslater. The decrease in the number of attacks wasnoticeable a few days after that specific scan, at theend of January, as can be seen on Figure 6.

Cluster 2 seems thus to be a good candidate. Thisis confirmed by looking at packets corresponding toboth clusters. Some packets associated to Cluster 1contain a 42-byte data payload sent to our honey-pot. In other words, these machines not only lookfor open ports but also try to send some data tothem. On the other hand, packets associated toCluster 2 are simple TCP SYN, half open scans.

Similar experiments on different platforms wouldenable us to determine more precisely the modusoperandi of these scan-only machines but, so far,this refined analysis leads to the fifth and final resultof this experiment:

5. Information gathered by a given scan is sharedbetween several communities of attackers. In-deed, we do not see a one-to-one relationshipbetween a scan-only IP and an attacker. Onthe contrary, we have more attacks than scans.76 different but precise attacks have been per-formed from October 20th, 2003 to January12th, 2004 (points 1 and 3 on the Figures re-spectively). Scans belonging to cluster 2 haveonly been observed five times for the same pe-riod. This tends to indicate that more thanone community is using the information pro-vided by a scan-only machine. Here to, we needmore machines to get a better insight on the in-teractions between scanner-only and attackers.This is part of our ongoing work.

Figure 7: Clusters activity (#attack sources) perday

5. Conclusion

Honeypots are very promising sensors to moni-tor local attack processes and to complement cur-rent tools which look at the malicious activity at ahigher level. These local observations bring differ-ent and relevant information that needs to be care-fully analyzed. We have shown in this paper thatthey allow to get a better understanding of the at-tack processes. This knowledge is currently lackingbut is necessary for the design of efficient securitysystems.

More precisely, we have managed in this pa-per to explain some weird phenomena, such as theAustralian activity which was dominant at a firstglance. We have also validated that machines arefacing multiple independent communities of attack-ers, and that it is possible to estimate their num-ber, as well as the kind of information they mightexchange together. Finally, we have justified theusage of honeypots as an interesting platform tocollect data in oder to build analytical models ofInternet threats.

Some other questions arise and require more ex-pertise. Numerous honeypot platforms placed invarious places will help finding their answers. Weare now deploying such honeypots and we hope thiswork will open new avenues for the ongoing work re-lated to honeypots. We invite all teams interestedin using our full data set for analytical purposes tocontact us. We have defined a simple model to shareour data : we grant access to all partners that ac-cept to put one honeypot, which we will configureremotely, in their premises. At the time of this writ-ing, 12 environments are up and running, in variouscountries in Europe but also in two other continents.We expect more to join in the next few weeks.

References

[1] S. Staniford, V. Paxson, and N. Weaver. Howto own the internet in your spare time. InProceedings of the 11th USENIX Security Sym-posium, pages 149–167. USENIX Association,2002.

[2] Z. Chen, L. Gao, and K. Kwiat. Modeling thespread of active worms. In IEEE INFOCOM,2003.

[3] C.C. Zou, W. Gong, and D. Towsley. Wormpropagation modeling and analysis under dy-namic quarantine defense. In ACM WORM03, October 2003.

[4] E. Spafford. An analyis of the internet worm.pages 446–468, September 1989.

[5] D. Moore, C. Shannon, G.M. Voelker, andS. Savage. Core-red a case study on thespread and victims of an internet worm. InACM/USENIX Internet Measurement Work-shop, November 2002.

[6] McAFee Security Antivirus. Virusprofile: W32/deloder worm.URL:http://us.mcafee.com/virusInfo/.

[7] F-Secure Corporation. Deloder worm analysis.URL:http://www.f-secure.com.

[8] VMWare Corporation. User’ s manual. version4.1. URL:http://www.vmware.com.

[9] Honeyd Virtual Honeypot from N. Provos.URL:http://www.honeyd.org.

[10] M. Dacier, F. Pouget, and H. Debar. Attackprocesses found on the internet. In NATO Sym-posium IST-041/RSY-013, April 2004.

[11] M. Dacier, F. Pouget, and H. Debar. Hon-eypots, a practical mean to validate maliciousfault assumptions. In The 10th Pacific ReamDependable Computing Conference (PRDC04),February 2004.

[12] F. Pouget and M. Dacier. Honeypot-basedforensics. In AusCERT Asia Pacific Infor-mation Technology Security Conference 2004(AusCERT2004), May 2004.

[13] CAIDA Project. Netgeo utility -the internet geographical database.URL:http://www.caida.org/tools/utilities/-netgeo/.

[14] DShield Distributed Intrusion Detection Sys-tem. URL:http://www.dshield.org.

[15] The SANS Institute Internet Storm Cen-ter. The trusted source for computer se-curity trainind, certification and research.URL:http://isc.sans.org.

[16] myNetWatchman. Network in-trusion detection and reporting.URL:http://www.mynetwatchman.com.

[17] X. Qin, D. Dagon, G. Gu, W. Lee, M. Warfield,and P. Allor. Worm detection using local net-works. In Proceedings of the Recent Advances ofIntrusion Detection RAID’04, September 2004.

[18] CAIDA Project The UCSD Network Tele-scope. The spread of the witty worm.URL:http://www.caida.org/analysis/security/-witty/.

[19] D. Moore, G. Voelker, and S. Savage. Infer-ing internet denial-of-service activity. In TheUSENIX Security Symposium, August 2001.

[20] K.E. Giles. On the spectral analysis ofbackscatter data. In GMP - Hawai 2004, 2004.URL:http://www.mts.jhu.edu/ priebe/FILES/-gmp hawaii04.pdf.

[21] TCPDump utility.URL:http://www.tcpdump.org.

[22] IPTables Netfilter Project.URL:http://www.netfilter.org.

[23] A. Rosin. Measuring availabilityin peer-to-peer networks. Septem-ber 2003. URL:http://www.wiwi.hu-berlin.de/ fis/p2pe/paper A Rosin.pdf.

[24] A. Zeitoun, C.N. Chuah, S. Bhattacharyya,and C. Diot. An as-level study of internetpath delay characteristics. Technical report, 2003.URL:http://ipmon.sprint.com/pubs trs/trs/RR03-ATL-051699-AS-delay.pdf.

[25] S.H. Yook, H. Jeong, and A.L. Barabasi.Modeling the internet’s large scaletopology. In PNAS -vol.99, October2002. URL:http://www.nd.edu/ net-works/PDF/Modeling.

[26] T.S. Eugene Ng and H. Zhang. Predictinginternet network distance with coordinates-based approaches. In INFOCOM 2002,2002. URL:http://www-2.cs.cmu.edu/ euge-neng/papers/INFOCOM02.pdf.

[27] MaxMind GeoIP CountryDatabase Commercial Product.URL:http://www.maxmind.com/app/products.

[28] H. Hubbard. Scob infection statistics, etc..., June2004. URL: http://www.securityfocus.incidents.

[29] Symantec. Symantec security re-sponse w32.welchia.worm, 2004. URL:http://response.symantec.com/avcentr/venc/-data/w32.welchia.b.worm.html.

[30] NMap utility from insecure.org.URL:http://www.insecure.org/nmap.

[31] p0f Passive Fingerprinting Tool.URL:http://lcamtuf.coredump.cx/p0f-beta.tgz.

[32] Disco Passive Fingerprinting Tool.URL:http://www.altmode.com/disco.

Documents

Understanding Threats: a Prerequisite to Enhance Survivability of Computing Systems