New Project Proposal - S²ERC · 2018-09-12 · Therefore, predicting attacks to an organization is equivalent to predicting exploitable vulnerability of the software systems used

New Project Proposal 1. Long Term Goal(s)

This project aims to predict accurately, the types of attacks an organization may face in near future and in real‐time fashion. Accurate and early prediction of attacks to an organization enables preventive actions before attacks. The ultimate result of this project will be a standalone software tool. When an organization uses our software tool, the software tool continuously monitors and collects 1) the softwares used by the organization, 2) vulnerabilities of these softwares that are exposed in public vulnerability databases (e.g., Common Vulnerabilities and Exposures (CVE)), and 3) tweets about these vulnerabilities on Twitter. Then, our software tool predicts whether a vulnerability is exploitable or not. If a vulnerability is exploitable, our software tool further predicts when (e.g., within 1 day, 1 week, or 1 month) the vulnerability is likely to be exploited. Given our prediction results, an organization can prioritize its efforts to update and patch its software systems before an attack happens.

2. Background for Long Term Goal(s) Many attacks leverage software vulnerability. A vulnerability is a certain software bug that has security implications. An exploit is an instance of the execution of a compromising code that leverages a vulnerability to subvert the software or system functionality. A software vulnerability is exploitable if we (or an attacker) can design an exploit based on the vulnerability.

Software systems inevitably have many vulnerabilities due to design flaws and implementation errors introduced by security‐unaware programmers. Indeed, the number of discovered software vulnerabilities has increased significantly in recent years [1,2]. However, a vulnerability by itself does not necessarily mean an immediate exploit. Therefore, predicting attacks to an organization is equivalent to predicting exploitable vulnerability of the software systems used by the organization.

Existing approaches to predict the exploitability of a vulnerability can be roughly classified into two categories: heuristics‐based and machine learning‐based. FIRST’s Common Vulnerability Scoring System (CVSS) [3], Microsoft’s exploitability index [4], and Adobe’s priority ratings [5] are example heuristics‐based approaches. These approaches assign a risk score to a vulnerability based on some heuristics such as complexity of the vulnerability, whether it requires user interaction to be exploited, and whether it requires authentication to be exploited. A higher risk score means a higher possibility of being exploited in the near future. The key limitation of heuristics‐based approaches is that they assign high risk scores to too many vulnerabilities. In other words, these approaches are highly likely to label an unexoploitable vulnerability as being exploitable. Machine learning‐based approaches [1,2] learn which vulnerability is exploitable from historical data. In particular, these approaches represent each vulnerability as a feature vector. Given a set of vulnerabilities with groundtruth label (either exploitable or not), these approaches learn a machine learning classifier. Then, the classifier is used to predict whether a new vulnerability is exploitable or not. These machine learning‐based approaches were demonstrated to achieve higher accuracies than heuristics‐based approaches. However, these approaches 1) either leverage public vulnerability database or (mainly) social‐media data, but not both; 2) rely on conventional machine learning classifier (e.g., Support Vector Machine (SVM) [6]); and 3) are insecure against fake social‐media data. As a result, these approaches achieve unsatisfying prediction accuracies and are insecure to adverserial manipulation.

Our proposed approach is a machine learning‐based approach, but it is different from existing machine learning‐based approaches in three aspects: 1) we combine the public vulnerability databases and social media, specifically, Twitter; 2) we employ more accurate prediction engine of deep learning; and 3) we design a graph‐based method to filter out fake social‐media information to avoid prediction spoofing.

Title: Predicting Organization Attacks via Mining Crowdsourcing Data

Date: 10/15/2016

Researcher Names: Neil Gong and Ratnesh Kumar

University: Iowa State University

3. Intermediate Term Objectives In this project, our specific objectives are to design accurate and secure deep learning‐based methods to solve the following two problems, as well as evaluate these methods using real‐world datasets.

Problem 1 (Exploitability Prediction). Given a software vulnerability, we predict whether the vulnerability is exploitable or not.

Problem 2 (Exploit‐time Prediction). Given an exploitable software vulnerability, we predict when (1 day, week, or month) the vulnerability will be exploited.

Figure 1 Learning phase Figure 2 Prediction phase

Overview of our framework: Our framework has two phases: learning phase and prediction phase. Figure 1 shows our learning phase, while Figure 2 shows our prediction phase. In the learning phase, we learn two deep learning‐based classifiers to solve the above two problems. In the prediction phase, we use the classifiers to perform exploitability prediction and exploit‐time prediction for new vulnerabilities. The learning phase is periodically executed (e.g., every week) to update the classifiers.

Learning phase: We first collect a set of vulnerabilities from a public vulnerability database (in our project we will use CVE). For each vulnerability, CVE has rich information characterizing the vulnerability. For instance, such information includes a unique ID, a description about the vulnerability, and comments from some reviewers. Second, we retrieve tweets that mention CVE vulnerabilities from Twitter. Using the vulnerability ID, we can identify which vulnerability (or vulnerabilities) a tweet is referring to. Before using these tweets, we use a fake‐data filter to filter fake tweets. By combining the CVE data and tweets, we have rich information (e.g., CVE description and tweets) for each vulnerability.

Second, we extract various features from the CVE data and tweets for each vulnerability. From CVE text data, we will extract bag‐of‐words features [7], which is a standard technique in natural language processing. From tweets, we will extract features including bag‐of‐words, number of tweets/retweets about the vulnerability, number of users who tweet about the vulnerability, and the reputations of these users. After feature extraction, we can represent each vulnerability as a high‐dimensional vector. Often, some of these manually engineered features might be informative for the classification tasks. Therefore, we will further perform feature selection or dimension reduction to identify a subset of informative features. We note that feature selection and dimension reduction are widely studied in the machine learning community, and we do not aim to invent new methods. In particular, we will use widely adopted feature selection methods (e.g., [16] and information gain [8,9]) and dimension reduction methods (e.g., PCA [10]). After feature selection or dimension reduction, each vulnerability is represented as a lower‐dimensional but potentially more predictive feature vector.

Third, we will collect groundtruth labels for vulnerabilities. The groundtruth label for a vulnerability includes 1) whether it is exploitable, and 2) if it is exploitable, the time between vulnerability exposure and being exploited. We will collect groundtruth labels from multiple sources such as Symantec’s public attack databases and Twitter data stream. For instance, Symantec’s attack‐signature database [11] lists various exploits, and some of them are related to specific CVE vulnerabilities. Figure 3 shows an example exploit from the Symantec’s attack‐signature database. This exploit is designed to leverage the vulnerability with an ID of CVE‐2002‐0005. Symantec’s security‐response database [12] further includes the time when an attack was discovered, based on which we can get approximated groundtruth time between vulnerability exposure and it being exploited.

Vulnerabilities,from,CVE

Twitter Fake6data,filter

Vulnerabilityrelatedtweets

Feature,extractor

Deeplearningengine

Classifier,forexploitability,prediction

Classifier,forexploit6timeprediction

Groundtruth from,multiple,sources

Onevulnerabilityfrom,CVE,

Twitter Fake6data,filter

Feature,extractor

Classifier,forexploitability,prediction

Classifier,forexploit6timeprediction

Tweets,about,the,vulnerability

Exploitable?When?

Figure 3 An example exploit in Symantec's attack‐signature database.

Figure 4 Exploitability prediction Figure 5 Exploit‐time prediction

Fourth, we will learn classifiers using the deep learning engine. Given a set of vulnerabilities, each of which is represented as a feature vector and has groundtruth labels, the deep learning engine outputs a classifier for exploitability prediction and a classifier for exploit‐time prediction. Figure 4 illustrates the deep neural network for exploitability prediction and Figure 5 illustrates the deep neural network for exploit‐time prediction. We will take exploitability prediction as an example to discuss more details. The first layer (also called the input layer) of the neural network represents the feature vector of a vulnerability, and the last layer (also called the output layer) represents whether the vulnerability is exploitable or not. The layers between the two layers are called hidden layers. Roughly speaking, recent deep learning studies [13,14] aim to (1) explore what hidden layers make deep neural networks more intelligent, and (2) develop efficient algorithms to learn the weights on the connections for deep neural networks with multiple hidden layers. In particular, recent results [14] in deep learning demonstrate that deep neural networks with hidden layers that perform convolution, max pooling, and dropping out achieve the state‐of‐the‐art performance in various applications. Moreover, recently developed efficient algorithms [13,14] and big data infrastructures (e.g., Google’s TensorFlow [15]) allow scalable and efficient learning of weights of a neural network with multiple layers. Suppose we are given a set of vulnerability‐exploitability pairs (i.e., training data sets). Given each vulnerability, a neural network computes a predicted exploitability. The weights of the neural network are learned such that the predicted outputs best match the desired outputs.

Prediction phase: In the prediction phase, for each new vulnerability, we use our learnt classifiers to predict whether the vulnerability is exploitable or not; and if it is exploitable, we further predict when will it be exploited. Specifically, given a new vulnerability, we collect its text data from CVE database and related tweets from Twitter. Then, we extract features from these data and perform feature selection/dimension reduction. We then use the exploitability classifier to predict whether the vulnerability is exploitable or not. If it is exploitable, we further use the exploit‐time classifier to predict when the vulnerability will be exploited.

Exploit

Vulnerability

…

…

Yes

No

Feature,1

Feature,2

Feature,n61

Feature,n

Input,layer Hidden,layer Hidden,layer Output,layer

……

1,day

1,month

Feature,1

Feature,2

Feature,n61

Feature,n

Input,layer Hidden,layer Hidden,layer Output,layer

>1,month

1,week

Fake‐data filter: Fake‐data filter is an important component of our framework. Our framework leverages Twitter data, which may not all be reliable‐‐‐An attacker can register fake Twitter accounts and post fake tweets about vulnerabilities, which will subsequently influence vulnerabilities’ feature vectors and eventually spoof our classifiers to make incorrect predictions. We note that CVE data is less likely to be spoofed because CVE requires editors to manually verify submitted vulnerabilities. We will leverage social graph to detect fake users. The key observation behind our method is that fake users are unlikely to have same level of connectivity with normal users, as the normal users have among themselves. For instance, on Twitter, although fake users can follow many normal users, but fewer normal users will follow the fake users. Figure 6 illustrates our idea. An edge between two users means that they follow each other. Given some labeled normal users and labeled fake users, we propagate the label information among the social graph to predict the labels (normal or fake) of the unlabeled user.

Figure 6 Detecting fake users using social graph structure

4. Schedule of Major Steps: Step 1 (0‐3 months): Collecting vulnerabilities from CVE, related tweets from Twitter, and groundtruth labels from Symantec’s public databases. Step 2 (3‐6 months): Designing predictive features and learning deep learning classifiers. Step 3 (6‐9 months): Evaluating and refining the classifiers. Step 4 (9‐12 months): Designing and evaluating a method to detect fake users. We already have a Twitter dataset including both normal users and fake users, and we will use the dataset for evaluation.

5. Budget: 45K+overhead is requested for 1 PhD student for 1 year (including fringe benefits, tuition and fees), and travel support to attend two instances of the S2ERC showcase.

6. Staffing: We will support one graduate student.

7. Category of Current Stage: It is a new proposal, seeking funding.

8. Contacts with Affiliates: We have not had any direct contact with any of the industry affiliates. We

submitted a Snapshot of our new proposal idea to S2ERC in response to their proposal call in mid September, and were later selected to present our ideas at the Nov. S2ERC meeting.

9. Publications and Other Research Products

Publications on text processing and deep learning Bing Hu, Bin Liu, Neil Zhenqiang Gong, Deguang Kong, Hongxia Jin. “Protecting Your Children from

Inappropriate Content in Mobile Apps: An Automatic Maturity Rating Framework”. In ACM International Conference on Information and Knowledge Management (CIKM), 2015.

Mathias Payer, Ling Huang, Neil Zhenqiang Gong, Kevin Borgolte, Mario Frank . “What You Submit is Who You Are: A Multi‐Modal Approach for Deanonymizing Scientific Publications”. In IEEE Transactions on Information Forensics and Security (TIFS), 10(1), 2015.

Normal Fake

?

??

?

?

?

?

?

?

Known,normal,users Known,fake,users

Sparse,connections

Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Richard Shin, Emil Stefanov, Dawn Song. “On the Feasibility of Internet‐Scale Author Identification”. In IEEE Symposium on Security & Privacy (IEEE S&P), 2012.

Publications on fake‐user detection Neil Zhenqiang Gong, Mario Frank, Prateek Mittal. “SybilBelief: A Semi‐supervised Learning Approach

for Structure‐based Sybil Detection”. In IEEE Transactions on Information Forensics and Security (TIFS), 9(6), 2014.

Publications on social network analysis Neil Zhenqiang Gong, Bin Liu. “You are Who You Know and How You Behave: Attribute Inference

Attacks via Users' Social Friends and Behaviors”. In USENIX Security Symposium, 2016.

Shouling Ji, Weiqing Li, Neil Zhenqiang Gong, Prateek Mittal, Raheem Beyah. “Seed‐based De‐anonymizability Quantification of Social Networks”. In IEEE Transactions on Information Forensics and Security (TIFS), 11(7), 2016.

Shouling Ji, Weiqing Li, Neil Zhenqiang Gong, Prateek Mittal, Raheem Beyah. “On Your Social Network De‐anonymizablity: Quantification and Large Scale Evaluation with Seed Knowledge”. In ISOC Network and Distributed System Security Symposium (NDSS), 2015.

Neil Zhenqiang Gong, Di Wang. “On the Security of Trustee‐based Social Authentications”. In IEEE Transactions on Information Forensics and Security (TIFS), 9(8), 2014.

Neil Zhenqiang Gong, Wenchang Xu. “Reciprocal versus Parasocial Relationships in Online Social Networks”. In Springer Social Network Analysis and Mining (SNAM), 4(1), 2014.

Neil Zhenqiang Gong, Ameet Talwalkar, Lester Mackey, Ling Huang, Richard Shin, Emil Stefanov, Elaine Shi, Dawn Song. “Joint Link Prediction and Attribute Inference using a Social‐Attribute Network”. In ACM Transactions on Intelligent Systems and Technology (TIST), 5(2), 2014.

Neil Zhenqiang Gong, Wenchang Xu, Ling Huang, Prateek Mittal, Emil Stefanov, Vyas Sekar, Dawn Song. “Evolution of Social‐Attribute Networks: Measurements, Modeling, and Implications using Google+”. In ACM/USENIX Internet Measurement Conference (IMC), 2012.

10. References [1] Bozorgi, M., Saul, L.K., Savage, S. and Voelker, G.M. Beyond heuristics: learning to classify vulnerabilities and predict exploits. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010. [2] Sabottke, C., Suciu, O. and Dumitraș, T. Vulnerability disclosure in the age of social media: exploiting twitter for predicting real‐world exploits. In USENIX Security Symposium, 2015. [3] SCARFONE, K., AND MELL, P. An analysis of CVSS version 2 vulnerability scoring. In 2009 3rd International Symposium on Empirical Software Engineering and Measurement, ESEM 2009 (2009), pp. 516–525. [4] ALBERTS, B., AND RESEARCHER, S. A Bounds Check on the Microsoft Exploitability Index The Value of an Exploitability Index Exploitability. [5] Adding priority ratings to adobe security bulletins. http://blogs.adobe.com/security/2012/02/when‐do‐i‐need‐toapply‐this‐update‐adding‐priority‐ratings‐to‐adobe‐securitybulletins‐2.html, 2012. [6] Cortes, Corinna, and Vladimir Vapnik. "Support‐vector networks." Machine learning 20.3 (1995): 273‐297. [7] Joachims, Thorsten. Learning to classify text using support vector machines: Methods, theory and algorithms. Kluwer Academic Publishers, 2002. [8] Information gain for feature selection. https://en.wikipedia.org/wiki/Information_gain_in_decision_trees [9] Michalski, Ryszard S., Jaime G. Carbonell, and Tom M. Mitchell, eds. Machine learning: An artificial intelligence approach. Springer Science & Business Media, 2013. [10] PCA. https://en.wikipedia.org/wiki/Principal_component_analysis [11] https://www.symantec.com/security_response/attacksignatures/ [12] https://www.symantec.com/security_response/landing/vulnerabilities.jsp [13] Hinton, Geoffrey E., Simon Osindero, and Yee‐Whye Teh. "A fast learning algorithm for deep belief nets." Neural computation 18.7 (2006): 1527‐1554. [14] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. [15] TensorFlow. https://www.tensorflow.org/ [16] Guyon, Isabelle, and André Elisseeff. "An introduction to variable and feature selection." Journal of machine learning research 3.Mar (2003): 1157‐1182.

Documents

New Project Proposal - S²ERC · 2018-09-12 · Therefore, predicting attacks to an organization is equivalent to predicting exploitable vulnerability of the software systems used