Vulnerability Disclosure in the Age of Social Media: …tdumitra/papers/USENIX...Vulnerability Disclosure in the Age of Social Media: Exploiting Twitter for Predicting Real-World Exploits

Vulnerability Disclosure in the Age of Social MediaExploiting Twitter for Predicting Real-World Exploits

Carl Sabottke Octavian SuciuUniversity of Maryland

Tudor Dumitras

AbstractIn recent years the number of software vulnerabilitiesdiscovered has grown significantly This creates a needfor prioritizing the response to new disclosures by assess-ing which vulnerabilities are likely to be exploited and byquickly ruling out the vulnerabilities that are not actuallyexploited in the real world We conduct a quantitativeand qualitative exploration of the vulnerability-relatedinformation disseminated on Twitter We then describethe design of a Twitter-based exploit detector and we in-troduce a threat model specific to our problem In addi-tion to response prioritization our detection techniqueshave applications in risk modeling for cyber-insuranceand they highlight the value of information provided bythe victims of attacks

1 IntroductionThe number of software vulnerabilities discovered hasgrown significantly in recent years For example 2014marked the first appearance of a 5 digit CVE as the CVEdatabase [46] which assigns unique identifiers to vulner-abilities has adopted a new format that no longer capsthe number of CVE IDs at 10000 per year Additionallymany vulnerabilities are made public through a coordi-nated disclosure process [18] which specifies a periodwhen information about the vulnerability is kept confi-dential to allow vendors to create a patch However thisprocess results in multi-vendor disclosure schedules thatsometimes align causing a flood of disclosures For ex-ample 254 vulnerabilities were disclosed on 14 October2014 across a wide range of vendors including MicrosoftAdobe and Oracle [16]

To cope with the growing rate of vulnerability discov-ery the security community must prioritize the effort torespond to new disclosures by assessing the risk that thevulnerabilities will be exploited The existing scoringsystems that are recommended for this purpose such asFIRSTrsquos Common Vulnerability Scoring System (CVSS)

[54] Microsoftrsquos exploitability index [21] and Adobersquospriority ratings [19] err on the side of caution by mark-ing many vulnerabilities as likely to be exploited [24]The situation in the real world is more nuanced Whilethe disclosure process often produces proof of conceptexploits which are publicly available recent empiricalstudies reported that only a small fraction of vulnerabili-ties are exploited in the real world and this fraction hasdecreased over time [2247] At the same time some vul-nerabilities attract significant attention and are quicklyexploited for example exploits for the Heartbleed bugin OpenSSL were detected 21 hours after the vulnera-bilityrsquos public disclosure [41] To provide an adequateresponse on such a short time frame the security com-munity must quickly determine which vulnerabilities areexploited in the real world while minimizing false posi-tive detections

The security vendors system administrators andhackers who discuss vulnerabilities on social media siteslike Twitter constitute rich sources of information as theparticipants in coordinated disclosures discuss technicaldetails about exploits and the victims of attacks sharetheir experiences This paper explores the opportuni-ties for early exploit detection using information avail-able on Twitter We characterize the exploit-related dis-course on Twitter the information posted before vulner-ability disclosures and the users who post this informa-tion We also reexamine a prior experiment on predictingthe development of proof-of-concept exploits [36] andfind a considerable performance gap This illuminatesthe threat landscape evolution over the past decade andthe current challenges for early exploit detection

Building on these insights we describe techniquesfor detecting exploits that are active in the real worldOur techniques utilize supervised machine learning andground truth about exploits from ExploitDB [3] OS-VDB [9] Microsoft security advisories [21] and thedescriptions of Symantecrsquos anti-virus and intrusion-protection signatures [23] We collect an unsampled cor-

pus of tweets that contain the keyword ldquoCVErdquo postedbetween February 2014 and January 2015 and we ex-tract features for training and testing a support vectormachine (SVM) classifier We evaluate the false posi-tive and false negative rates and we assess the detectionlead time compared to existing data sets Because Twit-ter is an open and free service we introduce a threatmodel considering realistic adversaries that can poisonboth the training and the testing data sets but that may beresource-bound and we conduct simulations to evaluatethe resilience of our detector to such attacks Finally wediscuss the implications of our results for building secu-rity systems without secrets the applications of early ex-ploit detection and the value of sharing information aboutsuccessful attacks

In summary we make three contributions

bull We characterize the landscape of threats related toinformation leaks about vulnerabilities before theirpublic disclosure and we identify features that canbe extracted automatically from the Twitter dis-course to detect exploitsbull To our knowledge we describe the first technique

for early detection of real-world exploits using so-cial mediabull We introduce a threat model specific to our problem

and we evaluate the robustness of our detector toadversarial interference

Roadmap In Sections 2 and 3 we formulate the prob-lem of exploit detection and we describe the design ofour detector respectively Section 4 provides an empir-ical analysis of the exploit-related information dissemi-nated on Twitter Section 5 presents our detection resultsand Section 6 evaluates attacks against our exploit detec-tors Section 7 reviews the related work and Section 8discusses the implications of our results

2 The problem of exploit detectionWe consider a vulnerability to be a software bug that hassecurity implications and that has been assigned a uniqueidentifier in the CVE database [46] An exploit is a pieceof code that can be used by an attacker to subvert thefunctionality of the vulnerable software While many re-searchers have investigated the techniques for creatingexploits the utilization patterns of these exploits provideanother interesting dimension to their security implica-tions We consider real-world exploits to be the exploitsthat are being used in real attacks against hosts and net-works worldwide In contrast proof-of-concept (PoC)exploits are often developed as part of the vulnerabilitydisclosure process and are included in penetration test-ing suites We further distinguish between public PoC

exploits for which the exploit code is publicly availableand private PoC exploits for which we can find reliableinformation that the exploit was developed but it wasnot released to the public A PoC exploit may also be areal-world exploit if it is used in attacks

The existence of a real-world or PoC exploit givesurgency to fixing the corresponding vulnerability andthis knowledge can be utilized for prioritizing remedi-ation actions We investigate the opportunities for earlydetection of such exploits by using information that isavailable publicly but is not included in existing vul-nerability databases such as the National VulnerabilityDatabase (NVD) [7] or the Open Sourced Vulnerabil-ity Database (OSVDB) [9] Specifically we analyze theTwitter stream which exemplifies the information avail-able from social media feeds On Twitter a communityof hackers security vendors and system administratorsdiscuss security vulnerabilities In some cases the vic-tims of attacks report new vulnerability exploits In othercases information leaks from the coordinated disclosureprocess [18] through which the security community pre-pares the response to the impending public disclosure ofa vulnerability

The vulnerability-related discourse on Twitter is in-fluenced by trend-setting vulnerabilities such as Heart-bleed (CVE-2014-0160) Shellshock (CVE-2014-6271CVE-2014-7169 and CVE-2014-6277) or Drupalged-don (CVE-2014-3704) [41] Such vulnerabilities arementioned by many users who otherwise do not provideactionable information on exploits which introduces asignificant amount of noise in the information retrievedfrom the Twitter stream Additionally adversaries mayinject fake information into the Twitter stream in an at-tempt to poison our detector Our goals in this paper are(i) to identify the good sources of information about ex-ploits and (ii) to assess the opportunities for early detec-tion of exploits in the presence of benign and adversarialnoise Specifically we investigate techniques for mini-mizing false-positive detectionsmdashvulnerabilities that arenot actually exploitedmdashwhich is critical for prioritizingresponse actions

Non-goals We do not consider the detection of zero-day attacks [32] which exploit vulnerabilities beforetheir public disclosure instead we focus on detecting theuse of exploits against known vulnerabilities Becauseour aim is to assess the value of publicly available infor-mation for exploit detection we do not evaluate the ben-efits of incorporating commercial or private data feedsThe design of a complete system for early exploit detec-tion which likely requires mechanisms beyond the realmof Twitter analytics (eg for managing the reputation ofdata sources to prevent poisoning attacks) is also out ofscope for this paper

21 ChallengesTo put our contributions in context we review the threeprimary challenges for predicting exploits in the ab-sence of adversarial interference class imbalance datascarcity and ground truth biases

Class imbalance We aim to train a classifier that pro-duces binary predictions each vulnerability is classifiedas either exploited or not exploited If there are signifi-cantly more vulnerabilities in one class than in the otherclass this biases the output of supervised machine learn-ing algorithms Prior research on predicting the existenceof proof-of-concept exploits suggests that this bias is notlarge as over half of the vulnerabilities disclosed before2007 had such exploits [36] However few vulnerabili-ties are exploited in the real world and the exploitation ra-tios tend to decrease over time [47] In consequence ourdata set exhibits a severe class imbalance we were ableto find evidence of real-world exploitation for only 13of vulnerabilities disclosed during our observation pe-riod This class imbalance represents a significant chal-lenge for simultaneously reducing the false positive andfalse negative detections

Data scarcity Prior research efforts on Twitter ana-lytics have been able to extract information from mil-lions of tweets by focusing on popular topics likemovies [27] flu outbreaks [20 26] or large-scale threatslike spam [56] In contrast only a small subset of Twit-ter users discuss vulnerability exploits (approximately32000 users) and they do not always mention the CVEnumbers in their tweets which prevents us from identi-fying the vulnerability discussed In consequence 90of the CVE numbers disclosed during our observationperiod appear in fewer than 50 tweets Worse whenconsidering the known real-world exploits close to halfhave fewer than 50 associated tweets This data scarcitycompounds the challenge of class imbalance for reducingfalse positives and false negatives

Quality of ground truth Prior work on Twitter ana-lytics focused on predicting quantities for which goodpredictors are already available (modulo a time lag) theHollywood Stock Exchange for movie box-office rev-enues [27] CDC reports for flu trends [45] and Twitterrsquosinternal detectors for highjacked accounts which trig-ger account suspensions [56] These predictors can beused as ground truth for training high-performance clas-sifiers In contrast there is no comprehensive data set ofvulnerabilities that are exploited in the real world Weemploy as ground truth the set of vulnerabilities men-tioned in the descriptions of Symantecrsquos anti-virus andintrusion-protection signatures which is reportedly thebest available indicator for the exploits included in ex-ploit kits [23 47] However this dataset has coverage

biases since Symantec does not cover all platforms andproducts uniformly For example since Symantec doesnot provide a security product for Linux Linux kernelvulnerabilities are less likely to appear in our groundtruth dataset than exploits targeting software that runs onthe Windows platform

22 Threat model

Research in adversarial machine learning [28 29] dis-tinguishes between exploratory attacks which poison thetesting data and causative attacks which poison both thetesting and the training data sets Because Twitter is anopen and free service causative adversaries are a realis-tic threat to a system that accepts inputs from all Twitterusers We assume that these adversaries cannot preventthe victims of attacks from tweeting about their obser-vations but they can inject additional tweets in order tocompromise the performance of our classifier To testthe ramifications of these causative attacks we develop athreat model with three types of adversaries

Blabbering adversary Our weakest adversary is notaware of the statistical properties of the training featuresor labels This adversary simply sends tweets with ran-dom CVEs and random security-related keywords

Word copycat adversary A stronger adversary isaware of the features we use for training and has accessto our ground truth (which comes from public sources)This adversary uses fraudulent accounts to manipulatethe word features and total tweet counts in the trainingdata However this adversary is resource constrainedand cannot manipulate any user statistics which wouldrequire either more expensive or time intensive accountacquisition and setup (eg creation date verificationfollower and friend counts) The copycat adversary craftstweets by randomly selecting pairs of non-exploited andexploited vulnerabilities and then sending tweets so thatthe word feature distributions between these two classesbecome nearly identical

Full copycat adversary Our strongest adversary hasfull knowledge of our feature set Additionally this ad-versary has sufficient time and economic resources topurchase or create Twitter accounts with arbitrary userstatistics with the exception of verification and the ac-count creation date Therefore the full copycat adversarycan use a set of fraudulent Twitter accounts to fully ma-nipulate almost all word and user-based features whichcreates scenarios where relatively benign CVEs and real-world exploit CVEs appear to have nearly identical Twit-ter traffic at an abstracted statistical level

Twitter Stream

Classification Results

ExploitDB Symantec

WINE Processing

Classification

Microsoft Security

Advisories

Public PoC

Private PoC

Exploits in the Wild

Ground Truth

CVSS amp Database Features

Database Features

Word amp Traffic Features

Feature Selection

Figure 1 Overview of the system architecture

3 A Twitter-based exploit detectorWe present the design of a Twitter-based exploit detectorusing supervised machine learning techniques Our de-tector extracts vulnerability-related information from theTwitter stream and augments it with additional sourcesof data about vulnerabilities and exploits

31 Data collection

Figure 1 illustrates the architecture of our exploit detec-tor Twitter is an online social networking service thatenables users to send and read short 140-character mes-sages called ldquotweetsrdquo which then become publicly avail-able For collecting tweets mentioning vulnerabilitiesthe system monitors occurrences of the ldquoCVErdquo keywordusing Twitterrsquos Streaming API [15] The policy of theStreaming API implies that a client receives all the tweetsmatching a keyword as long as the result does not ex-ceed 1 of the entire Twitter hose when the tweets be-come samples of the entire matching volume Becausethe CVE tweeting volume is not high enough to reach1 of the hose (as the API signals rate limiting) we con-clude that our collection contains all references to CVEsexcept during the periods of downtime for our infrastruc-ture

We collect data over a period of one year from Febru-ary 2014 to January 2015 Out of the 11 billion tweetscollected during this period 287717 contain explicit ref-erences to CVE IDs We identify 7560 distinct CVEsAfter filtering out the vulnerabilities disclosed before thestart of our observation period for which we have missedmany tweets we are left with 5865 CVEs

To obtain context about the vulnerabilities discussedon Twitter we query the National Vulnerability Database(NVD) [7] for the CVSS scores the products affectedand additional references about these vulnerabilitiesAdditionally we crawl the Open Sourced VulnerabilityDatabase (OSVDB) [9] for a few additional attributes

including the disclosure dates and categories of the vul-nerabilities in our study1 Our data collection infrastruc-ture consists of Python scripts and the data is stored us-ing Hadoop Distributed File System From the raw datacollected we extract multiple features using Apache PIGand Spark which run on top of a local Hadoop cluster

Ground truth We use three sources of ground truthWe identify the set of vulnerabilities exploited in the realworld by extracting the CVE IDs mentioned in the de-scriptions of Symantecrsquos anti-virus (AV) signatures [12]and intrusion-protection (IPS) signatures [13] Priorwork has suggested that this approach produces the bestavailable indicator for the vulnerabilities targeted in ex-ploits kits available on the black market [23 47] Con-sidering only the vulnerabilities included in our studythis data set contains 77 vulnerabilities targeting prod-ucts from 31 different vendors We extract the creationdate from the descriptions of AV signatures to estimatethe date when the exploits were discovered Unfortu-nately the IPS signatures do not provide this informa-tion so we query Symantecrsquos Worldwide IntelligenceNetwork Environment (WINE) [40] for the dates whenthese signatures were triggered in the wild For each real-world exploit we use the earliest date across these datasources as an estimate for the date when the exploit be-came known to the security community

However as mentioned in Section 21 this groundtruth does not cover all platforms and products uni-formly Nevertheless we expect that some software ven-dors which have well established procedures for coor-dinated disclosure systematically notify security com-panies of impending vulnerability disclosures to allowthem to release detection signatures on the date of disclo-sure For example the members of Microsoftrsquos MAPPprogram [5] receive vulnerability information in advanceof the monthly publication of security advisories Thispractice provides defense-in-depth as system adminis-trators can react to vulnerability disclosures either by de-ploying the software patches or by updating their AV orIPS signatures To identify which products are well cov-ered in this data set we group the exploits by the ven-dor of the affected product Out of the 77 real-worldexploits 41 (53) target products from Microsoft andAdobe while no other vendor accounts for more than3 of exploits This suggests that our ground truth pro-vides the best coverage for vulnerabilities in Microsoftand Adobe products

We identify the set of vulnerabilities with public proof-of-concept exploits by querying ExploitDB [3] a collab-orative project that collects vulnerability exploits We

1In the past OSVDB was called the Open Source VulnerabilityDatabase and released full dumps of their database Since 2012 OS-VDB no longer provides public dumps and actively blocks attempts tocrawl the website for most of the information in the database

identify exploits for 387 vulnerabilities disclosed duringour observation period We use the date when the ex-ploits were added to ExploitDB as an indicator for whenthe vulnerabilities were exploited

We also identify the set of vulnerabilities in Mi-crosoftrsquos products for which private proof-of-concept ex-ploits have been developed by using the Exploitabil-ity Index [21] included in Microsoft security advisoriesThis index ranges from 0 to 3 0 for vulnerabilities thatare known to be exploited in the real world at the timeof release for a security bulletin2 and 1 for vulnerabil-ities that allowed the development of exploits with con-sistent behavior Vulnerabilities with scores of 2 and 3are considered less likely and unlikely to be exploitedrespectively We therefore consider that the vulnerabili-ties with an exploitability index of 0 or 1 have an privatePoC exploit and we identify 218 such vulnerabilities 22of these 218 vulnerabilities are considered real-world ex-ploits in our Symantec ground truth

32 Vulnerability categoriesTo quantify how these vulnerabilities and exploits arediscussed on Twitter we group them into 7 categoriesbased on their utility for an attacker Code Execution In-formation Disclosure Denial of Service Protection By-pass Script Injection Session Hijacking and SpoofingAlthough heterogeneous and unstructured the summaryfield from NVD entries provides sufficient informationfor assigning a category to most of the vulnerabilities inthe study using regular expressions comprised of domainvocabulary

Table 2 and Section 4 show how these categories inter-sect with POC and real-world exploits Since vulnerabil-ities may belong to several categories (a code executionexploit could also be used in a denial of service) the reg-ular expressions are applied in order If a match is foundfor one category the subsequent categories would not bematched

Aditionally the Unknown category contains vulnera-bilities not matched by the regular expressions and thosewhose summaries explicitly state that the consequencesare unknown or unspecified

33 Classifier feature selectionThe features considered in this study can be classifiedin 4 categories Twitter Text Twitter Statistics CVSSInformation and Database Information

For the Twitter features we started with a set of 1000keywords and 12 additional features based on the dis-tribution of tweets for the CVEs eg the total number

2We do not use this score as an indicator for the existence of real-world exploits because the 0 rating is available only since August 2014toward the end of our observation period

Keyword MI Wild MI PoC Keyword MI Wild MI PoC

advisory 00007 00005 ok 00015 00002beware 00007 00005 mcafee 00005 00002sample 00007 00005 windows 00012 00011exploit 00026 00016 w 00004 00002go 00007 00005 microsoft 00007 00005xp 00007 00005 info 00007 Xie 00015 00005 rce 00007 Xpoc 00004 00006 patch 00007 Xweb 00015 00005 piyolog 00007 Xjava 00007 00005 tested 00007 Xworking 00007 00005 and X 00005fix 00012 00002 rt X 00005bug 00007 00005 eset X 00005blog 00007 00005 for X 00005pc 00007 00005 redhat X 00002reading 00007 00005 kali X 00005iis 00007 00005 0day X 00009ssl 00005 00003 vs X 00005post 00007 00005 linux X 00009day 00015 00005 new X 00002bash 00015 00009

Table 1 Mutual information provided by the reduced setof keywords with respect to both sources of ground truthdata The ldquoXrdquo marks in the table indicate that the respec-tive words were excluded from the final feature set dueto MI below 00001 nats

of tweets related to the CVE the average age of the ac-counts posting about the vulnerability and the numberof retweets associated to the vulnerability For each ofthese initial features we compute the mutual information(MI) of the set of feature values X and the class labelsY isin exploitednot exploited

MI(YX) = sumxisinX

sumyisinY

p(xy) ln(

p(xy)p(x)p(y)

)Mutual information expressed in nats compares the fre-quencies of values from the joint distribution p(xy) (ievalues from X and Y that occur together) with the prod-uct of the frequencies from the two distributions p(x) andp(y) MI measures how much knowing X reduces uncer-tainty about Y and can single out useful features sug-gesting that the vulnerability is exploited as well as fea-tures suggesting it is not We prune the initial feature setby excluding all features with mutual information below00001 nats For numerical features we estimate proba-bility distributions using a resolution of 50 bins per fea-ture After this feature selection process we are left with38 word features for real-world exploits

Here rather than use a wrapper method for featureselection we use this mutual information-based filtermethod in order to facilitate the combination of au-tomatic feature selection with intuition-driven manualpruning For example keywords that correspond totrend-setting vulnerabilities from 2014 like Heartbleedand Shellshock exhibit a higher mutual information thanmany other potential keywords despite their relation toonly a small subset of vulnerabilities Yet such highly

specific keywords are undesirable for classification dueto their transitory utility Therefore in order to reducesusceptibility to concept drift we manually prune theseword features for our classifiers to generate a final set of31 out of 38 word features (listed in Table 1 with addi-tional keyword intuition described in Section 41)

In order to improve performance and increase classi-fier robustness to potential Twitter-based adversaries act-ing through account hijacking and Sybil attacks we alsoderive features from NVD and OSVDB We considerall 7 CVSS score components as well as features thatproved useful for predicting proof-of-concept exploits inprior work [36] such as the number of unique referencesthe presence of the token BUGTRAQ in the NVD refer-ences the vulnerability category from OSVDB and ourown vulnerability categories (see Section 32) This givesus 17 additional features The inclusion of these non-Twitter features is useful for boosting the classifierrsquos re-silience to adversarial noise Figure 2 illustrates the mostuseful features for detecting real-world or PoC exploitsalong with the corresponding mutual information

34 Classifier training and evaluationWe train linear support vector machine (SVM) classi-fiers [35 38 39 43] in a feature space with 67 dimen-sions that results from our feature selection step (Sec-tion 33) SVMs seek to determine the maximum marginhyperplane to separate the classes of exploited and non-exploited vulnerabilities When a hyperplane cannot per-fectly separate the positive and negative class samplesbased on the feature vectors used in training the basicSVM cost function is modified to include a regulariza-tion penalty C and non-negative slack variables ξi Byvarying C we explore the trade-off between false nega-tives and false positives in our classifiers

We train SVM classifiers using multiple rounds ofstratified random sampling We perform sampling be-cause of the large imbalance in the class sizes be-tween vulnerabilities exploited and vulnerabilities notexploited Typically our classifier training consists of 10random training shuffles where 50 of the available datais used for training and the remaining 50 is used fortesting We use the scikit-learn Python package [49]to train our classifiers

An important caveat though is that our one yearof data limits our ability to evaluate concept drift Inmost cases our cross-validation data is temporally in-termixed with the training data since restricting sets oftraining and testing CVEs to temporally adjacent blocksconfounds performance losses due to concept drift withperformance losses due to small sample sizes Further-more performance differences between the vulnerabil-ity database features of our classifiers and those exploredin [36] emphasize the benefit of periodically repeating

Category CVEs Real-World PoC BothAll Data Good Coverage

Code Execution 1249322 6639 19214 288Info Disclosure 191859 40 695 40Denial Of Service 65717 00 161 00Protection Bypass 20434 00 30 00Script Injection 68314 00 400 00Session Hijacking 1671 00 250 00Spoofing 554 00 00 00Unknown 98151 70 426 50

Total 5914502 7739 38726 378

Table 2 CVEs Categories and exploits summary Thefirst sub-column represents the whole dataset while thesecond sub-column is restricted to Adobe and Microsoftvulnerabilities for which our ground truth of real-worldexploits provides good coverage

previous experiments from the security literature in or-der to properly assess whether the results are subject tolong-term concept drift

Performance metrics When evaluating our classi-fiers we rely on two standard performance metrics pre-cision and recall3 Recall is equivalent to the true pos-itive rate Recall = T P

T P+FN where T P is the number oftrue positive classifications and FN is the number of falsenegatives The denominator is the total number of posi-tive samples in the testing data Precision is defined asPrecision = T P

T P+FP where FP is the total number of falsepositives identified by the classifier When optimizingclassifier performance based on these criteria the rela-tive importance of these quantities is dependent on theintended applications of the classifier If avoiding falsenegatives is priority then recall must be high Howeverif avoiding false positives is more critical then precisionis the more important metric Because we envision uti-lizing our classifier as a tool for prioritizing the responseto vulnerability disclosures we focus on improving theprecision rather than the recall

4 Exploit-related information on Twitter

Table 2 breaks down the vulnerabilities in our study ac-cording to the categories described in Section 32 1249vulnerabilities allowing code execution were disclosedduring our observation period 66 have real-world ex-ploits and 192 have public proof-of-concept exploitsthe intersection of these two sets includes 28 exploitsIf we consider only Microsoft and Adobe vulnerabilitiesfor which we expect that our ground truth has good cov-erage (see Section 31) the table shows that 322 code-

3We choose precision and recall because Receiver Operating Char-acteristic (ROC) curves can present an overly optimistic view of a clas-sifierrsquos performance when dealing with skewed data sets []

0 1 2 3 4 5 6 7 8 9

on in N

CVSS Information Twitter Traffic Database Information

Real World

Public POC

Figure 2 Mutual information between real world and public proof of concept exploits for CVSS Twitter user statisticsNVD and OSVDB features CVSS Information - 0 CVSS Score 1 Access Complexity 2 Access Vector 3Authentication 4 Availability Impact 5 Confidentiality Impact 6 Integrity Impact Twitter Traffic - 7 Numberof tweets 89 users with minimum T followersfriends 1011 retweetsreplies 12 tweets favorited 131415Avg hashtagsURLsuser mentions per tweet 16 verified accounts 17 Avg age of accounts 18 Avg of tweetsper account Database Information - 19 references in NVD 20 sources in NVD 2122 BUGTRAQSECUNIAin NVD sources 23 allow in NVD summay 24 NVD last modified date - NVD published date 25 NVD lastmodified date - OSVDB disclosed date 26 Number of tokens in OSVDB title 27 Current date - NVD last modifieddate 28 OSVDB in NVD sources 29 code in NVD summay 30 OSVDB entries 31 OSVDB Category 32Regex Category 33 First vendor in NVD 34 vendors in NVD 35 affected products in NVD

execution vulnerabilities were disclosed 39 have real-world exploits 14 have public PoC exploits and 8 haveboth real-world and public PoC exploits

Information disclosure is the largest category of vul-nerabilities from NVD and it has a large number ofPoC exploits but we find few of these vulnerabilitiesin our ground truth of real-world exploits (one excep-tion is Heartbleed) Instead most of the real-world ex-ploits focus on code execution vulnerabilities Howevermany proof-of-concept exploits for such vulnerabilitiesdo not seem to be utilized in real-world attacks To un-derstand the factors that drive the differences betweenreal-world and proof-of-concept exploits we examinethe CVSS base metrics which describe the character-istics of each vulnerability This analysis reveals thatmost of the real-world exploits allow remote code execu-tion while some PoC exploits require local host or localnetwork access Moreover while some PoC vulnerabil-ities require authentication before a successful exploitreal-world exploits focus on vulnerabilities that do notrequire bypassing authentication mechanisms In factthis is the only type of exploit we found in the segmentof our ground truth that has good coverage suggestingthat remote code-execution exploits with no authentica-tion required are strongly favored by real-world attack-ers Our ground truth does not provide good coverageof web exploits which explains the lack of Script In-jection Session Highjacking and Spoofing exploits fromour real-world data set

Surprisingly we find that among the remote executionvulnerabilities for which our ground truth provides goodcoverage there are more real-world exploits than public

PoC exploits This could be explained by the increas-ing prevalence of obfuscated disclosures as reflected inNVD vulnerability summaries that mention the possibil-ity of exploitation ldquovia unspecified vectorsrdquo (for exam-ple CVE-2014-8439) Such disclosures make it moredifficult to create PoC exploits as the technical informa-tion required is not readily available but they may notthwart determined attackers who have gained experiencein hacking the product in question

41 Exploit-related discourse on Twitter

The Twitter discourse is dominated by a few vulnerabil-ities Heartbleed (CVE-2014-0160) received the highestattention with more than 25000 tweets (8000 posted inthe first day after disclosure) 24 vulnerabilities receivedmore than 1000 tweets 16 of these vulnerabilities wereexploited 11 in the real-world 12 in public proofs ofconcept and 8 in private proofs of concept The mediannumber of tweets across all the vulnerabilities in our dataset is 14

The terms that Twitter users employ when discussingexploits also provide interesting insights Surprisinglythe distribution of the keyword ldquo0dayrdquo exhibits a highmutual information with public proof-of-concept ex-ploits but not with real-world exploits This could beexplained by confusion over the definition of the termzero-day vulnerability many Twitter users understandthis to mean simply a new vulnerability rather than a vul-nerability that was exploited in real-world attacks beforeits public disclosure [32] Conversely the distribution ofthe keyword ldquopatchrdquo has high mutual information only

with the real-world exploits because a common reasonfor posting tweets about vulnerabilities is to alert otherusers and point them to advisories for updating vulner-able software versions after exploits are detected in thewild Certain words like ldquoexploitrdquo or ldquoadvisoryrdquo are use-ful for detecting both real-world and PoC exploits

42 Information posted before disclosureFor the vulnerabilities in our data set Figure 3 com-pares the dates of the earliest tweets mentioning thecorresponding CVE numbers with the public disclosuredates for these vulnerabilities Using the disclosure daterecorded in OSVDB we identify 47 vulnerabilities thatwere mentioned in the Twitter stream before their pub-lic disclosure We investigate these cases manually todetermine the sources of this information 11 of thesecases represent misspelled CVE IDs (eg users mention-ing 6172 but talking about 6271 ndash Shellshock) and weare unable to determine the root cause for 5 additionalcases owing to the lack of sufficient information The re-maining cases can be classified into 3 general categoriesof information leaks

Disagreements about the planned disclosure dateThe vendor of the vulnerable software sometimes postslinks to a security bulletin ahead of the public disclosuredate These cases are typically benign as the securityadvisories provide instructions for patching the vulnera-bility A more dangerous situation occurs when the partywho discovers the vulnerability and the vendor disagreeabout the disclosure schedule resulting in the publica-tion of vulnerability details a few days before a patch ismade available [6141617] We have found 13 cases ofdisagreements about the disclosure date

Coordination of the response to vulnerabilities dis-covered in open-source software The developers ofopen-source software sometimes coordinate their re-sponse to new vulnerabilities through social media egmailing lists blogs and Twitter An example for this be-havior is a tweet about a wget patch for CVE-2014-4877posted by the patch developer followed by retweets andadvice to update the binaries If the public discussionstarts before a patch is completed then this is potentiallydangerous However in the 5 such cases we identifiedthe patching recommendations were first posted on Twit-ter and followed by an increased retweet volume

Leaks from the coordinated disclosure process Insome cases the participants in the coordinated disclosureprocess leak information before disclosure For examplesecurity researchers may tweet about having confirmed

Figure 3 Comparison of the disclosure dates with thedates when the first tweets are posted for all the vulner-abilities in our dataset Plotted in red is the identity linewhere the two dates coincide

that a vulnerability is exploitable along with the soft-ware affected This is the most dangerous situation asattackers may then contact the researcher with offers topurchase the exploit before the vendor is able to releasea patch We have identified 13 such cases

43 Users with information-rich tweetsThe tweets we have collected were posted by approxi-mately 32000 unique users but the messages posted bythese users are not equally informative Therefore wequantify utility on a per user basis by computing the ra-tio of CVE tweets related to real-world exploits as wellas the fraction of unique real-world exploits that a givenuser tweets about We rank user utility based on the har-monic mean of these two quantities This ranking penal-izes users that tweet about many CVEs indiscriminately(eg a security news bot) as well as the thousands ofusers that only tweet about the most popular vulnera-bilities (eg Shellshock and Heartbleed) We create awhitelist with the top 20 most informative users andwe use this whitelist in our experiments in the follow-ing sections as a means of isolating our classifier frompotential adversarial attacks Top ranked whitelist usersinclude computer repair servicemen posting about thelatest viruses discovered in their shops and security re-searchers and enthusiasts sharing information about thelatest blog and news postings related to vulnerabilitiesand exploits

Figure 4 provides an example of how the informationabout vulnerabilities and exploits propagates on Twitteramongst all users The ldquoFutexrdquo bug which enables unau-thorized root access on Linux systems was disclosed onJune 6 as CVE-2014-3153 After identifying users whoposted messages that had retweets counting for at least1 of the total tweet counts for this vulnerability and ap-plying a thresholding based on the number of retweets

Jul 2014 Aug 2014 Sep 2014 Oct 2014 Nov 2014 Dec 2014 Jan 2015Date

tweets

1 2 3 4 5 6 7 8

CVE-2014-3153 tweet volume

Tweets by influential users

Figure 4 Tweet volume for CVE-2014-3153 and tweetsfrom the most influential users These influential tweetsshape the volume of Twitter posts about the vulnerabilityThe 8 marks represent important events in the vulnerabil-ityrsquos lifecycle 1 - disclosure 2 - Android exploit calledTowelroot is reported and exploitation attempts are de-tected in the wild 345 - new technical details emergeincluding the Towelroot code 6 - new mobile phonescontinue to be vulnerable to this exploit 7 - advisoryabout the vulnerability is posted 8 - exploit is includedin ExploitDB

we identify 20 influential tweets that shaped the vol-ume of Twitter messages These tweets correspond to8 important milestones in the vulnerabilityrsquos lifecycle asmarked in the figure

While CVE-2014-3153 is known to be exploited in thewild it is not included in our ground truth for real-worldexploits which does not cover the Linux platform Thisexample illustrates that monitoring a subset of users canyield most of the vulnerability- and exploit-related infor-mation available on Twitter However over reliance on asmall number of user accounts even with manual anal-ysis can increase susceptibility to data manipulation viaadversarial account hijacking

5 Detection of proof-of-concept and real-world exploits

To provide a baseline for our ability to classify exploitswe first examine the performance of a classifier that usesonly the CVSS score which is currently recommendedas the reference assessment method for software secu-rity [50] We use the total CVSS score and the ex-ploitability subscore as a means of establishing baselineclassifier performances The exploitability subscore iscalculated as a combination of the CVSS access vectoraccess complexity and authentication components Boththe total score and exploitability subscore range from0-10 By varying a threshold across the full range ofvalues for each score we can generate putative labels

0 2 4 6 8 10CVSS Threshold

Exploitability Recall

Exploitability Precision

Total Recall

Total Precision

Figure 5 Precision and recall for classifying real worldexploits with CVSS score thresholds

where vulnerabilities with scores above the threshold aremarked as ldquoreal-world exploitsrdquo and vulnerabilities be-low the threshold are labeled as ldquonot exploitedrdquo Unsur-prisingly since CVSS is designed as a high recall sys-tem which errs on the side of caution for vulnerabilityseverity the maximum possible precision for this base-line classifier is less than 9 Figure 5 shows the recalland precision values for both total CVSS score thresh-olds and CVSS exploitability subscore thresholds

Thus this high recall low precision vulnerabilityscore is not useful by itself for real-world exploit iden-tification and boosting precision is a key area for im-provement

Classifiers for real-world exploits Classifiers forreal-world exploits have to deal with a severe class im-balance we have found evidence of real-world exploita-tion for only 13 of the vulnerabilities disclosed dur-ing our observation period To improve the classifica-tion precision we train linear SVM classifiers on a com-bination of CVSS metadata features features extractedfrom security-related tweets and features extracted fromNVD and OSVDB (see Figure 2) We tune these clas-sifiers by varying the regularization parameter C andwe illustrate the precision and recall achieved Valuesshown are for cross-validation testing averaged across10 stratified random shuffles Figure 6a shows the av-erage cross-validated precision and recall that are simul-taneously acheivable with our Twitter-enhanced featureset These classifiers can achieve higher precision thana baseline classifier that uses only the CVSS score butthere is still a tradeoff between precision and recall Wecan tune the classifier with regularization to decrease thenumber of false positives (increasing precision) but thiscomes at the cost of a larger number of false negatives(decreasing recall)

0 20 40 60 80 100Recall

All CVE

MS and Adobe Only

Min 50 Tweets

User Whitelist

(a) Real-world exploits

0 20 40 60 80 100Recall

All CVE

Min 10 Tweets

Min 20 Tweets

(b) Public proof-of-concept exploits

0 20 40 60 80 100Recall

Twitter Words

CVSS Features

Twitter User Features

DB Features

All Features

(c) Private proof-of-concept exploits for vulner-abilities in Microsoft products

Figure 6 Precision and recall of our classifiers

This result partially reflects an additional challenge forour classifier the fact that our ground truth is imperfectas Symantec does not have products for all the platformsthat are targeted in attacks If our Twitter-based classifierpredicts that a vulnerability is exploited and the exploitexists but is absent from the Symantec signatures wewill count this instance as a false positive penalizing thereported precision To assess the magnitude of this prob-lem we restrict the training and evaluation of the clas-sifier to the 477 vulnerabilities in Microsoft and Adobeproducts which are likely to have a good coverage in ourground truth for real-world exploits (see Section 31) 41of these 477 vulnerabilities are identified as real-worldexploits in our ground truth For comparison we includethe performance of this classifier in Figure 6a Improv-ing the quality of the ground truth allows us to bolsterthe values of precision and recall which are simultane-ously achievable while still enabling classification pre-cision an order of magnitude larger than a baseline CVSSscore-based classifier Additionally while restricting thetraining of our classifier to a whitelist made up of the top20 most informative Twitter users (as described in Sec-tion 43) does not enhance classifier performance it doesallow us to achieve a precision comparable to the previ-ous experiments (Figure 6a) This is helpful for prevent-ing an adversary from poisoning our classifier as dis-cussed in Section 6

These results illustrate the current potential and limi-tations for predicting real-world exploits using publicly-available information Further improvements in theclassification performance may be achieved through abroader effort for sharing information about exploits ac-tive in the wild in order to assemble a high-coverageground truth for training of classifiers

Classifiers for proof-of-concept exploits We exploretwo classification problems predicting public proof-of-

concept exploits for which the exploit code is publiclyavailable and predicting private proof-of-concept ex-ploits for which we can find reliable information thatthe exploit was developed but it was not released to thepublic We consider these problems separately as theyhave different security implications and the Twitter usersare likely to discuss them in distinct ways

First we train a classifier to predict the availabilityof exploits in ExploitDB [3] the largest archive of pub-lic exploits This is similar to the experiment reportedby Bozorgi et al in [36] except that our feature set isslightly differentmdashin particular we extract word featuresfrom Twitter messages rather than from the textual de-scriptions of the vulnerabilities However we include themost useful features for predicting proof-of-concept ex-ploits as reported in [36] Additionally Bozorgi et aldetermined the availability of proof-of-concept exploitsusing information from OSVDB [9] which is typicallypopulated using references to ExploitDB but may alsoinclude vulnerabilities for which the exploit is rumoredor private (approximately 17 of their exploit data set)After training a classifier with the information extractedabout the vulnerabilities disclosed between 1991ndash2007they achieved a precision of 875

Surprisingly we are not able to reproduce their per-formance results as seen in Figure 6b when analyz-ing the vulnerabilities that appear in ExploitDB in 2014The figure also illustrates the performance of a classi-fier trained with exclusion threshold for CVEs that lacksufficient quantities of tweets These volume thresholdsimprove performance but not dramatically4 In part thisis due to our smaller data set compared to [36] made upof vulnerabilities disclosed during one year rather thana 16-year period Moreover our ground truth for publicproof-of-concept exploits also exhibits a high class im-

4In this case we do not restrict the exploits to specific vendors asExploitDB will incorporate any exploit submitted

balance only 62 of the vulnerabilities disclosed dur-ing our observation period have exploits available in Ex-ploitDB This is in stark contrast to the prior work wheremore than half of vulnerabilities had proof of concept ex-ploits

This result provides an interesting insight into the evo-lution of the threat landscape today proof-of-conceptexploits are less centralized than in 2007 and ExploitDBdoes not have total coverage of public proof-of-conceptexploits We have found several tweets with links toPoC exploits published on blogs or mailing lists thatwere missing from ExploitDB (we further analyze theseinstances in Section 4) This suggests that informationabout public exploits is increasingly dispersed amongsocial media sources rather than being included in acentralized database like ExploitDB5 which representsa hurdle for prioritizing the response to vulnerability dis-closures

To address this concern we explore the potential forpredicting the existence of private proof-of-concept ex-ploits by considering only the vulnerabilities disclosed inMicrosoft products and by using Microsoftrsquos Exploitabil-ity Index to derive our ground truth Figure 6c illustratesthe performance of a classifier trained with a conserva-tive ground truth in which we treat vulnerabilities withscores of 1 or less as exploits This classifier achievesprecision and recall higher than 80 even when onlyrelying on database feature subsets Unlike in our priorexperiments for the Microsoft Exploitability Index theclasses are more balanced 67 of the vulnerabilities(218 out of 327) are labeled as having a private proof-of-concept exploit However only 8 of the Microsoft vul-nerabilities in our dataset are contained within our real-world exploit ground truth Thus by using a conservativeground truth that labels many vulnerabilities as exploitswe can achieve high precision and recall but this classi-fier performance does not readily translate to real-worldexploit prediction

Contribution of various feature groups to the clas-sifier performance To understand how our featurescontribute to the performance of our classifiers in Fig-ure 7 we compare the precision and recall of our real-world exploit classifier when using different subgroupsof features In particular incorporating Twitter data intothe classifiers allows for improving the precision be-yond the levels of precision achievable with data thatis currently available publicly in vulnerability databasesBoth user features and word features generated based ontweets are capable of bolstering classifier precision incomparison to CVSS and features extracted from NVD

5Indeed OSVDB no longer seems to provide the exploitation avail-ability flag

0 20 40 60 80 100Recall

All Features

Words Only

CVSS Features

User Features

DB Features

Figure 7 Precision and recall for classification of realworld exploits with different feature subsets Twitter fea-tures allow higher-precision classification of real-worldexploits

and OSVDB Consequently the analysis of social mediastreams like Twitter is useful for boosting identificationof exploits active in the real-world

51 Early detection of exploits

In this section we ask the question How soon can wedetect exploits active in the real world by monitoring theTwitter stream Without rapid detecton capabilities thatleverage the real-time data availability inherent to socialmedia platforms a Twiter-based vulnerability classifierhas little practical value While the first tweets abouta vulnerability precede the creation of IPS or AV sig-natures by a median time of 10 days these first tweetsare typically not informative enough to determine thatthe vulnerability is likely exploited We therefore sim-ulate a scenario where our classifier for real-world ex-ploits is used in an online manner in order to identifythe dates when the output of the classifier changes fromldquonot exploitedrdquo to ldquoexploitedrdquo for each vulnerability inour ground truth

We draw 10 stratified random samples each with 50coverage of ldquoexploitedrdquo and ldquonot exploitedrdquo vulnerabil-ities and we train a separate linear SVM classifier witheach one of these samples (C = 00003) We start testingeach of our 10 classifiers with a feature set that does notinclude features extracted from tweets to simulate theactivation of the online classifier We then continue totest the ensemble of classifiers incrementally by addingone tweet at a time to the testing set We update the ag-gregated prediction using a moving average with a win-dow of 1000 tweets Figure 8a highlights the tradeoffbetween precision and early detection for a range of ag-gregated SVM prediction thresholds Notably thoughlarge precision sacrifices do not necessarily lead to large

gains for the speed of detection Therefore we choose anaggregated prediction threshold of 095 which achieves45 precision and a median lead prediction time of twodays before the first Symantec AV or WINE IPS signa-ture dates Figure 8b shows how our classifier detectionlags behind first tweet appearances The solid blue lineshows the cumulative distribution function (CDF) for thenumber of days difference between the first tweet appear-ance for a CVE and the first Symantec AV or IPS attacksignature The green dashed line shows the CDF for theday difference between 45 precision classification andthe signature creation Negative day differences indicatethat Twitter events occur before the creation of the attacksignature In approximately 20 of cases early detec-tion is impossible because the first tweet for a CVE oc-curs after an attack signature has been created For theremaining vulnerabilities Twitter data provides valuableinsights into the likelihood of exploitation

Figure 8c illustrates early detection for the case ofHeartbleed (CVE-2014-1060) The green line indicatesthe first appearance of the vulnerability in ExploitDBand the red line indicates the date when a Symantec at-tack signature was published The dashed black linerepresents the earliest time when our online classifier isable to detect this vulnerability as a real-world exploit at45 precision Our Twitter-based classifier provides anldquoexploitedrdquo output 3 hours after the first tweet appearsrelated to this CVE on April 7 2014 Heartbleed ex-ploit traffic was detected 21 hours after the vulnerabil-ityrsquos public disclosure [41] Heartbleed appeared in Ex-ploitDB on the day after disclosure (April 8 2014) andSymantec published the creation of an attack signatureon April 9 2014 Additionally by accepting lower lev-els of precision our Twitter-based classifiers can achieveeven faster exploit detection For example with classi-fier precision set to approximately 25 Heartbleed canbe detected as an exploit within 10 minutes of its firstappearance on Twitter

6 Attacks against the exploit detectors

The public nature of Twitter data necessitates consider-ing classification problems not only in an ideal environ-ment but also in an environment where adversaries mayseek to poison the classifiers In causative adversarialmachine learning (AML) attacks the adversaries makeefforts to have a direct influence by corrupting and alter-ing the training data [28 29 31]

With Twitter data learning the statistics of the train-ing data is as simple as collecting tweets with either theREST or Streaming APIs Features that are likely to beused in classification can then be extracted and evaluatedusing criteria such as correlation entropy or mutual in-formation when ground truth data is publicly available

In this regard the most conservative assumption for se-curity is that an adversary has complete knowledge of aTwitter-based classifierrsquos training data as well as knowl-edge of the feature set

When we assume an adversary works to create bothfalse negatives and false positives (an availability AMLsecurity violation) practical implementation of a ba-sic causative AML attack on Twitter data is relativelystraightforward Because of the popularity of spam onTwitter websites such as buyaccscom cheaply selllarge volumes of fraudulent Twitter accounts For ex-ample on February 16 2015 on buyaccscom the base-line price for 1000 AOL email-based Twitter accountswas $17 with approximately 15000 accounts availablefor purchase This makes it relatively cheap (less than$300 as a base cost) to conduct an attack in which alarge number of users tweet fraudulent messages contain-ing CVEs and keywords which are likely to be used ina Twitter-based classifier as features Such an attackerhas two main limitations The first limitation is thatwhile the attacker can add an extremely large number oftweets to the Twitter stream via a large number of differ-ent accounts the attacker has no straightforward mech-anism for removing legitimate potentially informativetweets from the dataset The second limitation is thatadditional costs must be incurred if an attackerrsquos fraud-ulent accounts are to avoid identification Cheap Twitteraccounts purchased in bulk have low friend counts andlow follower counts A user profile-based preprocess-ing stage of analysis could easily eliminate such accountsfrom the dataset if an adversary attempts to attack a Twit-ter classification scheme in such a rudimentary mannerTherefore to help make fraudulent accounts seem morelegitimate and less readily detectable an adversary mustalso establish realistic user statistics for these accounts

Here we analyze the robustness of our Twitter-basedclassifiers when facing three distinct causative attackstrategies The first attack strategy is to launch acausative attack without any knowledge of the trainingdata or ground truth This blabbering adversary essen-tially amounts to injecting noise into the system Thesecond attack strategy corresponds to the word-copycatadversary who does not create a sophisticated networkbetween the fraudulent accounts and only manipulatesword features and the total tweet count for each CVEThis attacker sends malicious tweets so that the wordstatistics for tweets about non-exploited and exploitedCVEs appear identical at a user-naive level of abstrac-tion The third most powerful adversary we consider isthe full-copycat adversary This adversary manipulatesthe user statistics (friend follower and status counts)of a large number of fraudulent accounts as well as thetext content of these CVE-related tweets to launch amore sophisticated Sybil attack The only user statis-

5 10 15 20 25 30 35 40 45 50Percent Precision

d in A

dvance o

ttack S

ignatu

(a) Tradeoff between classification speed andprecision

200 150 100 50 0 50 100 150 200 250Day Difference between Twitter and Attack Signature Dates

First Tweet DifferenceTweet Detection Difference

(b) CDFs for day differences between firsttweet for a CVE first classification and attacksignature creation date

18 22 02 06 10 14 18 22 02 06Time (Hour)

lassifie

CVE-2014-0160

Symantec SignatureExploitDBTwitter Detection

(c) Heartbleed detection timeline betweenApril 7th and April 10th 2014

Figure 8 Early detection of real-world vulnerability exploits

tics which we assume this full copycat adversary can-not arbitrarily manipulate are account verification statusand account creation date since modifying these featureswould require account hijacking The goal of this adver-sary is for non-exploited and exploited CVEs to appearas statistically identical as possible on Twitter at a user-anonymized level of abstraction

For all strategies we assume that the attacker has pur-chased a large number of fraudulent accounts Fewerthan 1000 out of the more than 32000 Twitter users inour CVE tweet dataset send more than 20 CVE-relatedtweets in a year and only 75 accounts send 200 or moreCVE-related tweets Therefore if an attacker wishesto avoid tweet volume-based blacklisting then each ac-count cannot send a high number of CVE-related tweetsConsequently if the attacker sets a volume threshold of20-50 CVE tweets per account then 15000 purchasedaccounts would enable the attacker to send 300000-750000 adversarial tweets

The blabbering adversary even when sending 1 mil-lion fraudulent tweets is not able to force the preci-sion of our exploit detector below 50 This suggeststhat Twitter-based classifiers can be relatively robust tothis type of random noise-based attack (black circles inFig 9) When dealing with the word-copycat adversary(green squares in Fig 9) performance asymptoticallydegrades to 30 precision The full-copycat adversarycan cause the precision to drop to approximately 20 bysending over 300000 tweets from fraudulent accountsThe full-copycat adversary represents a practical upperbound for the precision loss that a realistic attacker caninflict on our system Here performance remains abovebaseline levels even for our strongest Sybil attacker dueto our use of non-Twitter features to increase classifierrobustness Nevertheless in order to recover perfor-

0 05 1 15 2 25 3 35 4Number of Fraudulent Tweets Sent (x100000)

Blabbering

Word Copycat

Full Copycat

Figure 9 Average linear SVM precision when trainingand testing data are poisoned by the three types of adver-saries from our threat model

mance implementing a Twitter-based vulnerability clas-sifier in a realistic setting is likely to require curation ofwhitelists and blacklists for informative and adversarialusers As shown in figure 6a restricting the classifier toonly consider the top 20 of users with the most relevanttweets about real-world exploits causes no performancedegradation and fortifies the classifier against low tier ad-versarial threats

7 Related work

Previous work by Allodi et al has highlighted multipledeficiencies in CVSS version 2 as a metric for predict-ing whether or not a vulnerability will be exploited in thewild [24] specifically because predicting the small frac-tion of vulnerabilities exploited in the wild is not one ofthe design goals of CVSS By analyzing vulnerabilitiesin exploit kits work by Allodi et al has also established

that Symantec threat signatures are an effective sourceof information for determining which vulnerabilities areexploited in the real world even if the coverage is notcomplete for all systems [23]

The closest work to our own is Bozorgi et al whoapplied linear support vector machines to vulnerabilityand exploit metadata in order predict the development ofproof-of-concept exploits [36] However the existenceof a POC exploit does not necessarily mean that an ex-ploit will be leveraged for attacks in the wild and realworld attacks occur in only of a small fraction of thecases for which a vulnerability has a proof of concept ex-ploit [25 47] In contrast we aim to detect exploits thatare active in the real world Hence our analysis expandson this prior work [36] by focusing classifier training onreal world exploit data rather than POC exploit data andby targeting social media data as a key source of featuresfor distinguishing real world exploits from vulnerabili-ties that are not exploited in the real world

In prior analysis of Twitter data success has beenfound in a wide variety of applications including earth-quake detection [53] epidemiology [26 37] and thestock market [34 58] In the security domain muchattention has been focused on detecting Twitter spamaccounts [57] and detecting malicious uses of Twitteraimed at gaining political influence [30 52 55] Thegoals of these works is distinct from our task of predict-ing whether or not vulnerabilities are exploited in thewild Nevertheless a practical implementation of ourvulnerability classification methodology would requirethe detection of fraudulent tweets and spam accounts toprevent poisoning attacks

8 Discussion

Security in Twitter analytics Twitter data is publiclyavailable and new users are free to join and start send-ing messages In consequence we cannot obfuscate orhide the features we use in our machine learning systemEven if we had not disclosed the features we found mostuseful for our problem an adversary can collect Twit-ter data as well as the data sets we use for ground truth(which are also public) and determine the most informa-tion rich features within the training data in the the sameway we do Our exploit detector is an example of a se-curity system without secrets where the integrity of thesystem does not depend on the secrecy of its design orof the features it uses for learning Instead the securityproperties of our system derive from the fact that the ad-versary can inject new messages in the Twitter streambut cannot remove any messages sent by the other usersOur threat model and our experimental results providepractical bounds for the damage the adversary can inflicton such a system This damage can be reduced further

by incorporating techniques for identifying adversarialTwitter accounts for example by assigning a reputationscore to each account [44 48 51]

Applications of early exploit detection Our resultssuggest that the information contained in security-related tweets is an important source for timely security-related information Twitter-based classifiers can be em-ployed to guide the prioritization of response actions af-ter vulnerability disclosures especially for organizationswith strict policies for testing patches prior to enterprise-wide deployment which makes patching a resource-intensive effort Another potential application is model-ing the risk associated with vulnerabilities for exampleby combining the likelihood of real-world exploitationproduced by our system with additional metrics for vul-nerability assessment such as the CVSS severity scoresor the odds that the vulnerable software is exploitablegiven its deployment context (eg whether it is attachedto a publicly-accessible network) Such models are keyfor the emerging area of cyber-insurance [33] and theywould benefit from an evidence-based approach for esti-mating the likelihood of real-world exploitation

Implications for information sharing efforts Our re-sults highlight the current challenges for the early detec-tion of exploits in particular the fact that the existingsources of information for exploits active in the wild donot cover all the platforms that are targeted by attackersThe discussions on security-related mailing lists such asBugtraq [10] Full Disclosure [4] and oss-security [8]focus on disclosing vulnerabilities and publishing ex-ploits rather than on reporting attacks in the wild Thismakes it difficult for security researchers to assemblea high-quality ground truth for training supervised ma-chine learning algorithms At the same time we illustratethe potential of this approach In particularour whitelistidentifies 4335 users who post information-rich mes-sages about exploits We also show that the classificationperformance can be improved significantly by utilizinga ground truth with better coverage We therefore en-courage the victims of attacks to share relevant technicalinformation perhaps through recent information-sharingplatforms such as Facebookrsquos ThreatExchange [42] orthe Defense Industrial Base voluntary information shar-ing program [1]

9 ConclusionsWe conduct a quantitative and qualitative explorationof information available on Twitter that provides earlywarnings for the existence of real-world exploits Amongthe products for which we have reliable ground truthwe identify more vulnerabilities that are exploited in

the real-world than vulnerabilities for which proof-of-concept exploits are available publicly We also iden-tify a group of 4335 users who post information-richmessages about real-world exploits We review severalunique challenges for the exploit detection problem in-cluding the skewed nature of vulnerability datasets thefrequent scarcity of data available at initial disclosuretimes and the low coverage of real world exploits in theground truth data sets that are publicly available Wecharacterize the threat of information leaks from the co-ordinated disclosure process and we identify featuresthat are useful for detecting exploits

Based on these insights we design and evaluate adetector for real-world exploits utilizing features ex-tracted from Twitter data (eg specific words number ofretweets and replies information about the users postingthese messages) Our system has fewer false positivesthan a CVSS-based detector boosting the detection pre-cision by one order of magnitude and can detect exploitsa median of 2 days ahead of existing data sets We alsointroduce a threat model with three types of adversariesseeking to poison our exploit detector and through sim-ulation we present practical bounds for the damage theycan inflict on a Twitter-based exploit detector

AcknowledgmentsWe thank Tanvir Arafin for assistance with our early datacollection efforts We also thank the anonymous review-ers Ben Ransford (our shepherd) Jimmy Lin JonathanKatz and Michelle Mazurek for feedback on this re-search This research was supported by the MarylandProcurement Office under contract H98230-14-C-0127and by the National Science Foundation which providedcluster resources under grant CNS-1405688 and gradu-ate student support with GRF 2014161339

References[1] Cyber security information assurance (CSIA) program http

dibnetdodmil

[2] Drupalgeddon vulnerability - cve-2014-3704 httpscve

mitreorgcgi-bincvenamecginame=CVE-2014-3704

[3] Exploits database by offensive security httpshttp

exploit-dbcom

[4] Full disclosure mailing list httpseclistsorg

fulldisclosure

[5] Microsoft active protections program httpstechnet

microsoftcomen-ussecuritydn467918

[6] Microsoftrsquos latest plea for cvd is as much propaganda as sin-cere httpblogosvdborg20150112microsofts-latest-plea-for-vcd-is-as-much-propaganda-as-sincere

[7] National vulnerability database (NVD) httpsnvdnist

[8] Open source security httposs-securityopenwall

orgwikimailing-listsoss-security

[9] Open sourced vulnerability database (OSVDB) httpwww

osvdborg

[10] Securityfocus bugtraq httpwwwsecurityfocuscom

archive1

[11] Shellshock vulnerability - cve-2014-6271 httpscve

[12] Symantec a-z listing of threats amp riskshttpwwwsymanteccomsecurity responselandingazlistingjsp

[13] Symantec attack signatures httpwwwsymanteccom

security_responseattacksignatures

[14] Technet blogs - a call for better coordinated vulnerability disclo-sure httpblogstechnetcombmsrcarchive20150111a-call-for-better-coordinated-vulnerability-disclosureaspx

[15] The twitter streaming api httpsdevtwittercom

streamingoverview

[16] Vendors sure like to wave the rdquocoordinationrdquo flaghttpblogosvdborg20150202vendors-sure-like-to-wave-the-coordination-flag-revisiting-the-perfect-storm

[17] Windows elevation of privilege in user profile service - googlesecurity research httpscodegooglecompgoogle-security-researchissuesdetailid=123

[18] Guidelines for security vulnerability reporting and responseTech rep Organization for Internet Safety September 2004httpwwwsymanteccomsecurityOIS_Guidelines

20for20responsible20disclosurepdf

[19] Adding priority ratings to adobe security bulletinshttpblogsadobecomsecurity201202when-do-i-need-to-apply-this-update-adding-priority-ratings-to-adobe-security-bulletins-2html 2012

[20] ACHREKAR H GANDHE A LAZARUS R YU S-H ANDLIU B Predicting flu trends using twitter data In ComputerCommunications Workshops (INFOCOM WKSHPS) 2011 IEEEConference on (2011) IEEE pp 702ndash707

[21] ALBERTS B AND RESEARCHER S A Bounds Check on theMicrosoft Exploitability Index The Value of an Exploitability In-dex Exploitability 1ndash7

[22] ALLODI L AND MASSACCI F A Preliminary Analysis of Vul-nerability Scores for Attacks in Wild In CCS BADGERS Work-shop (Raleigh NC Oct 2012)

[23] ALLODI L AND MASSACCI F A preliminary analysis of vul-nerability scores for attacks in wild Proceedings of the 2012ACM Workshop on Building analysis datasets and gathering ex-perience returns for security - BADGERS rsquo12 (2012) 17

[24] ALLODI L AND MASSACCI F Comparing vulnerabilityseverity and exploits using case-control studies (Rank B)ACMTransactions on Embedded Computing Systems 9 4 (2013)

[25] ALLODI L SHIM W AND MASSACCI F Quantitative As-sessment of Risk Reduction with Cybercrime Black Market Mon-itoring 2013 IEEE Security and Privacy Workshops (2013) 165ndash172

[26] ARAMAKI E MASKAWA S AND MORITA M TwitterCatches The Flu Detecting Influenza Epidemics using TwitterThe University of Tokyo In Proceedings of the 2011 Conferenceon Emperical Methods in Natural Language Processing (2011)pp 1568ndash1576

[27] ASUR S AND HUBERMAN B A Predicting the future withsocial media In Web Intelligence and Intelligent Agent Technol-ogy (WI-IAT) 2010 IEEEWICACM International Conference on(2010) vol 1 IEEE pp 492ndash499

[28] BARRENO M BARTLETT P L CHI F J JOSEPH A DNELSON B RUBINSTEIN B I SAINI U AND TYGAR JOpen problems in the security of learning In Proceedings of the1st (2008) p 19

[29] BARRENO M NELSON B JOSEPH A D AND TYGARJ D The security of machine learning Machine Learning 81(2010) 121ndash148

[30] BENEVENUTO F MAGNO G RODRIGUES T ANDALMEIDA V Detecting spammers on twitter Collaborationelectronic messaging anti-abuse and spam conference (CEAS) 6(2010) 12

[31] BIGGIO B NELSON B AND LASKOV P Support VectorMachines Under Adversarial Label Noise ACML 20 (2011) 97ndash112

[32] BILGE L AND DUMITRAS T Before we knew it An empir-ical study of zero-day attacks in the real world In ACM Confer-ence on Computer and Communications Security (2012) T YuG Danezis and V D Gligor Eds ACM pp 833ndash844

[33] BOHME R AND SCHWARTZ G Modeling Cyber-InsuranceTowards a Unifying Framework In WEIS (2010)

[34] BOLLEN J MAO H AND ZENG X Twitter mood predictsthe stock market Journal of Computational Science 2 (2011)1ndash8

[35] BOSER B E GUYON I M AND VAPNIK V N A TrainingAlgorithm for Optimal Margin Classifiers In Proceedings of the5th Annual ACM Workshop on Computational Learning Theory(1992) pp 144ndash152

[36] BOZORGI M SAUL L K SAVAGE S AND VOELKERG M Beyond Heuristics Learning to Classify Vulnerabilitiesand Predict Exploits In KDD rsquo10 Proceedings of the 16th ACMSIGKDD international conference on Knowledge discovery anddata mining (2010) p 9

[37] BRONIATOWSKI D A PAUL M J AND DREDZE M Na-tional and local influenza surveillance through twitter An analy-sis of the 2012-2013 influenza epidemic PLoS ONE 8 (2013)

[38] CHANG C-C AND LIN C-J LIBSVM A Library for Sup-port Vector Machines ACM Transactions on Intelligent Systemsand Technology 2 (2011) 271mdash-2727

[39] CORTES C CORTES C VAPNIK V AND VAPNIK V Sup-port Vector Networks Machine Learning 20 (1995) 273˜ndash˜297

[40] DUMITRAS T AND SHOU D Toward a Standard Benchmarkfor Computer Security Research The Worldwide IntelligenceNetwork Environment (WINE) In EuroSys BADGERS Work-shop (Salzburg Austria Apr 2011)

[41] DURUMERIC Z KASTEN J ADRIAN D HALDERMANJ A BAILEY M LI F WEAVER N AMANN J BEEK-MAN J PAYER M AND PAXSON V The matter of Heart-bleed In Proceedings of the Internet Measurement Conference(Vancouver Canada Nov 2014)

[42] FACEBOOK ThreatExchange httpsthreatexchange

[43] GUYON I GUYON I BOSER B BOSER B VAPNIK VAND VAPNIK V Automatic Capacity Tuning of Very Large VC-Dimension Classifiers In Advances in Neural Information Pro-cessing Systems (1993) vol 5 pp 147ndash155

[44] HORNG D CHAU P NACHENBERG C WILHELM JWRIGHT A AND FALOUTSOS C Polonium Tera-ScaleGraph Mining and Inference for Malware Detection Siam In-ternational Conference on Data Mining (Sdm) (2011) 131ndash142

[45] LAZER D M KENNEDY R KING G AND VESPIGNANIA The parable of Google Flu traps in big data analysis

[46] MITRE Common vulnerabilities and exposures (CVE) url-httpcvemitreorg

[47] NAYAK K MARINO D EFSTATHOPOULOS P ANDDUMITRAS T Some Vulnerabilities Are Different Than Oth-ers Studying Vulnerabilities and Attack Surfaces in the Wild InProceedings of the 17th International Symposium on Research inAttacks Intrusions and Defenses (Gothenburg Sweeden Sept2014)

[48] PANDIT S CHAU D WANG S AND FALOUTSOS C Net-probe a fast and scalable system for fraud detection in onlineauction networks Proceedings of the 16th 42 (2007) 210201

[49] PEDREGOSA F VAROQUAUX G GRAMFORT A MICHELV THIRION B GRISEL O BLONDEL M PRETTEN-HOFER P WEISS R DUBOURG V VANDERPLAS J PAS-SOS A COURNAPEAU D BRUCHER M PERROT M ANDDUCHESNAY E Scikit-learn Machine Learning in PythonJournal of Machine Learning Research 12 (2011) 2825ndash2830

[50] QUINN S SCARFONE K BARRETT M AND JOHNSON CGuide to Adopting and Using the Security Content AutomationProtocol (SCAP) Version 10 NIST Special Publication 800-117 July 2010

[51] RAJAB M A BALLARD L LUTZ N MAVROMMATIS PAND PROVOS N CAMP Content-agnostic malware protectionIn Network and Distributed System Security (NDSS) Symposium(San Diego CA Feb 2013)

[52] RATKIEWICZ J CONOVER M D MEISS M GONC BFLAMMINI A AND MENCZER F Detecting and TrackingPolitical Abuse in Social Media Artificial Intelligence (2011)297ndash304

[53] SAKAKI T OKAZAKI M AND MATSUO Y Earthquakeshakes Twitter users real-time event detection by social sensorsIn WWW rsquo10 Proceedings of the 19th international conferenceon World wide web (2010) p 851

[54] SCARFONE K AND MELL P An analysis of CVSS version 2vulnerability scoring In 2009 3rd International Symposium onEmpirical Software Engineering and Measurement ESEM 2009(2009) pp 516ndash525

[55] THOMAS K GRIER C AND PAXSON V Adapting SocialSpam Infrastructure for Political Censorship In LEET (2011)

[56] THOMAS K LI F GRIER C AND PAXSON V Conse-quences of Connectivity Characterizing Account Hijacking onTwitter ACM Press pp 489ndash500

[57] WANG A H Donrsquot follow me Spam detection in Twitter2010 International Conference on Security and Cryptography(SECRYPT) (2010) 1ndash10

[58] WOLFRAM M S A Modelling the Stock Market using Twittericcsinformaticsedacuk Master of (2010) 74

Introduction

The problem of exploit detection

Challenges

Threat model

A Twitter-based exploit detector

Data collection

Vulnerability categories

Classifier feature selection

Classifier training and evaluation

Exploit-related information on Twitter

Exploit-related discourse on Twitter

Information posted before disclosure

Users with information-rich tweets

Detection of proof-of-concept and real-world exploits

Early detection of exploits

Attacks against the exploit detectors

Related work

Discussion

Conclusions