16
Automatic Simplification of Obfuscated JavaScript Code: A Semantics-Based Approach Gen Lu Saumya Debray Department of Computer Science The University of Arizona Tucson, AZ 85721, USA Email: {genlu, debray}@cs.arizona.edu Abstract—JavaScript is a scripting language that is commonly used to create sophisticated interactive client- side web applications. With the popularity growth in those “Web 2.0” sites, browsing the Internet without JavaScript support is impractical. However, JavaScript code can also be used to exploit vulnerabilities in the web browser and its extensions, and in recent years it has become a major mechanism for web-based malware delivery. In order to avoid detection, attackers often take advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes the identification of malicious JavaScript code challenging: current ap- proaches to analyzing such code typically either make assumptions that may not hold in practice, or else require manual intervention that can be tedious and time-consuming. This paper describes a semantics-based approach for automatic deobfuscation of JavaScript code using dynamic analysis and slicing. Experiments using a prototype implementation indicate that our approach is able to penetrate multiple layers of complex obfuscations and extract the core logic of the computation, which makes it easier to understand the behavior of the code. I. I NTRODUCTION Recent years have seen a dramatic rise in the role of the World-Wide Web in many aspects of our day-to- day lives. Unfortuately, there has been a concomitant increase in the use of the Web as a malware-delivery platform: even a single visit by an unwary user to an infected website may be sufficient to cause an arbitrary malicious payload to be downloaded and executed without user’s permission or knowledge. This process, known as drive-by-downloading, is carried out by exploiting vulnerabilities in web browsers and/or their extensions [25], [26] and can be delivered us- ing a variety of web-based mechanisms [24]. Drive- by-downloads often rely on JavaScript, a scripting language widely used for generating dynamic web content. Attackers often take advantage of the dy- namic nature of JavaScript to create highly obfuscated code to avoid detection [13]. The growing prevalence of malicious JavaScript code is exemplified by the Gumblar worm, which uses dynamically generated and heavily obfuscated JavaScript to avoid detection and identification, and which at one point was considered to be the fastest-growing threat on the Internet [17]. Identifying malicious JavaScript code is not easy, however. The mere presence of obfuscated JavaScript does not, in itself, signal the presence of malicious content, since benign web pages also to use code obfuscation to protect intellectual property [7], [16]. Moreover, attackers often use server-side scripting to deliver randomly obfuscated code where each instance is syntactically different from the next (server-side polymorphism). For these reasons, static signature- based heuristics (e.g., “eval(’ and ‘unescape(’ within 15 bytes of each other” [17]) have limited success when dealing with obfuscated JavaScript. Traditional anti-virus tools that process web pages typically rely on such syntactic heuristics and so tend to produce a high misidentification rate [7], [13], [16]. A better solution would be to use semantics-based techniques that focus on code behavior. Unfortunately, current behavioral analysis techniques for obfuscated JavaScript typically require considerable manual inter- vention, e.g., to modify the code in specific ways or to monitor its execution within a debugger [19], [22], [32]. There has been some recent work on automated behavioral analyses of obfuscated JavaScript that in many cases has a “deobfuscator” component [4], [6], [7], [9], [16], as well as standalone JavaScript deobfus- cation tools [1], [10], [23], [30]. These deobfuscators all rely on some simple and intuitive assumptions about the obfuscation and the structure of the obfuscated code. Although these assumptions seem plausible, it is not difficult to construct obfuscations that violate them and thereby defeat the corresponding deobfuscators. This is illustrated in Section IV.

Automatic Simplification of Obfuscated JavaScript Code: A ...genlu/pub/js-deobf-web.pdf · advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Automatic Simplification of Obfuscated JavaScript Code: A ...genlu/pub/js-deobf-web.pdf · advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes

Automatic Simplification of Obfuscated JavaScript Code:A Semantics-Based Approach

Gen Lu Saumya DebrayDepartment of Computer Science

The University of ArizonaTucson, AZ 85721, USA

Email: {genlu, debray}@cs.arizona.edu

Abstract—JavaScript is a scripting language that iscommonly used to create sophisticated interactive client-side web applications. With the popularity growth inthose “Web 2.0” sites, browsing the Internet withoutJavaScript support is impractical. However, JavaScriptcode can also be used to exploit vulnerabilities in theweb browser and its extensions, and in recent years ithas become a major mechanism for web-based malwaredelivery. In order to avoid detection, attackers often takeadvantage of the dynamic nature of JavaScript to createhighly obfuscated code. This makes the identificationof malicious JavaScript code challenging: current ap-proaches to analyzing such code typically either makeassumptions that may not hold in practice, or elserequire manual intervention that can be tedious andtime-consuming. This paper describes a semantics-basedapproach for automatic deobfuscation of JavaScript codeusing dynamic analysis and slicing. Experiments using aprototype implementation indicate that our approach isable to penetrate multiple layers of complex obfuscationsand extract the core logic of the computation, whichmakes it easier to understand the behavior of the code.

I. I NTRODUCTION

Recent years have seen a dramatic rise in the role ofthe World-Wide Web in many aspects of our day-to-day lives. Unfortuately, there has been a concomitantincrease in the use of the Web as a malware-deliveryplatform: even a single visit by an unwary user toan infected website may be sufficient to cause anarbitrary malicious payload to be downloaded andexecuted without user’s permission or knowledge. Thisprocess, known asdrive-by-downloading, is carried outby exploiting vulnerabilities in web browsers and/ortheir extensions [25], [26] and can be delivered us-ing a variety of web-based mechanisms [24]. Drive-by-downloads often rely on JavaScript, a scriptinglanguage widely used for generating dynamic webcontent. Attackers often take advantage of the dy-namic nature of JavaScript to create highly obfuscated

code to avoid detection [13]. The growing prevalenceof malicious JavaScript code is exemplified by theGumblar worm, which uses dynamically generated andheavily obfuscated JavaScript to avoid detection andidentification, and which at one point was consideredto be the fastest-growing threat on the Internet [17].

Identifying malicious JavaScript code is not easy,however. The mere presence of obfuscated JavaScriptdoes not, in itself, signal the presence of maliciouscontent, since benign web pages also to use codeobfuscation to protect intellectual property [7], [16].Moreover, attackers often use server-side scripting todeliver randomly obfuscated code where each instanceis syntactically different from the next (server-sidepolymorphism). For these reasons, static signature-based heuristics (e.g., “‘eval(’ and ‘unescape(’ within15 bytes of each other” [17]) have limited successwhen dealing with obfuscated JavaScript. Traditionalanti-virus tools that process web pages typically relyon such syntactic heuristics and so tend to produce ahigh misidentification rate [7], [13], [16].

A better solution would be to use semantics-basedtechniques that focus on code behavior. Unfortunately,current behavioral analysis techniques for obfuscatedJavaScript typically require considerable manual inter-vention, e.g., to modify the code in specific ways orto monitor its execution within a debugger [19], [22],[32]. There has been some recent work on automatedbehavioral analyses of obfuscated JavaScript that inmany cases has a “deobfuscator” component [4], [6],[7], [9], [16], as well as standalone JavaScript deobfus-cation tools [1], [10], [23], [30]. These deobfuscatorsall rely on some simple and intuitive assumptions aboutthe obfuscation and the structure of the obfuscatedcode. Although these assumptions seem plausible, it isnot difficult to construct obfuscations that violate themand thereby defeat the corresponding deobfuscators.This is illustrated in Section IV.

Page 2: Automatic Simplification of Obfuscated JavaScript Code: A ...genlu/pub/js-deobf-web.pdf · advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes

This paper proposes a different approach to analyzeobfuscated JavaScript code. We collect bytecode exe-cution traces from the target program and use dynamicslicing and semantics-preserving code transformationsto automatically simplify the trace, then reconstructdeobfuscated JavaScript code from simplified trace.The code so obtained isobservationally equivalenttothe original program for the execution path considered,but has obfuscations simplified away, thereby exposingthe core logic of the computation performed by theoriginal code. The resulting code can then be examinedmanually or fed to other analysis tools for further pro-cessing. Our approach differs from existing approachesin that it makes no assumptions about the structure ofthe obfuscation and uses semantics-based techniquesto reveal the behavior of the code.

In previous work [18], we have described asemantics-based approach to deobfuscating “core”JavaScript, i.e. code that runs on a standaloneJavaScript engine. This does not account for interac-tions between the executing JavaScript code and thehost browser, which provides the Document ObjectModel (DOM) for manipulating HTML documents andthe ability to interact with external websites. This pa-per, by contrast, focuses on the more challenging taskof deobfuscating JavaScript code in the full contextof the browser’s execution environment. This is muchmore complex than the “core” JavaScript considered in[18], and admits many more options for obfuscation,but is able to handle the full range of behaviors of realJavaScript malware. We evalute our approach usinga prototype implemention and test it againest bothsynthetic programs and real malware code. The resultsshow that our approach can penetrate multiple layersof complex obfuscations, some of which cannot behandled by existing techniques, and extract the corelogic of the underlying computation.

The rest of this paper is organized as follows.Section II provides the background information ofJavaScript malwares and obfuscation techniques. Sec-tion III describes the concept of semantic-based de-obfuscation and the implementation of our approach.Section IV presents our experimental evaluation. Sec-tions V and VI discusses related and future work, andSection VII concludes. Appendix A presents the com-plete code contexts for malware sampleP4 discussedin Section IV.

II. BACKGROUND

This section provides an overview of the JavaScriptlanguage and the host environment of web browser anddescribes some widely used real-world code obfusca-tion techniques.

A. JavaScript Basics

The term “JavaScript”, commonly used to refer toa scripting language used for client-side programmingof dynamic websites, actually encompasses several dis-tinct components. These include the core programminglanguage, an implementation of the ECMA-262 stan-dard; and the host enviroment, namely, the DocumentObject Model (DOM) provided by web browser.

The core JavaScript language provides a set ofdata types (e.g.Boolean, String, Object), a collectionof built-in objects and functions (e.g.RegExp, Math,Date), and a prototype-based inheritance mechanism,among other things. Like most scripting languages,JavaScript is highly dynamic in nature. It is dynam-ically typed, which means that a variable can takeon values of different types at different points in aprogram. Properties of (associative-array based) ob-jects can be added/deleted on the fly, and code can begenerated from strings at runtime using built-inevalfunction.

The Document Object Model (DOM) is an API thatabstracts HTML documents as a structural represen-tation of objects and provides a mechanism for ma-nipulating this abstraction, thereby enabling JavaScriptcode to modify and interact with the content of webpages dynamically. For example,write method of thedocument object can be used to dynamically writeHTML expressions or JavaScript code to a document.In contrast to the built-in objects defined in coreJavaScript, objects defined in the DOM specificationare called “host objects”, and are provided as a part ofthe host environment by the web browser.

B. JavaScript Runtime

At the implementation level, JavaScript typicallyuses an expression-stack-based byte-code interpreter;modern implementations of these interpreters usuallycome with JIT compilers. For example, Mozilla’spopular FireFox web browser uses an open sourceJavaScript interpreter, SpiderMonkey, written in C/C++[21]. This is a single dispatching function that stepsthrough the bytecode one instruction at a time.

Page 3: Automatic Simplification of Obfuscated JavaScript Code: A ...genlu/pub/js-deobf-web.pdf · advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes

As discussed in previous sections, client-sideJavaScript programs have the ability of generating codeat runtime, using various mechanisms provided by bothinterpreter and browser (e.g.eval(), document.write()and iframe, etc). Further, dynamic code generation canbe multi-layered, e.g., a string that iseval-ed mayitself contain calls toeval, and such embedded callsto eval can be stacked several layers deep. We referto such dynamic code generating ascode unfolding,and for each piece of code generated by runtimeunfolding, we call it acode context. Native and non-native functions are also terms being used through outthis paper. In JavaScript programs, functions definedin JavaScript (using keywordfunction) are callednon-native functions; and functions provided by the inter-preter or browser (e.g. built-in functions and methodsof host objects) are callednative functions. Unlikenon-native functions, native functions do not generatea bytecode trace when executed. We will see thisdifference of native/non-native functions is importantto our deobfuscation technique.

C. JavaScript Obfuscation Techniques

The dynamic nature of Javascript code makes pos-sible a variety of obfuscation techniques. Particularlychallenging is the combination of the ability to executea string using code unfolding, as described above,and the fact that the string being “executed” maybe obfuscated in a wide variety of ways. Howarddiscusses several such techniques in more detail [13].For example, the characters in the string can be en-coded in various ways, e.g., using %-encoding (a as%61, b as %62, . . . ), Unicode (a as \u0061, b as\u0062, . . . ), Base-64, etc. The string can be kept inencrypted, compressed, or permuted form. It can beconstructed at runtime by concatenating other stringstogether. Besides, in addition to the traditional “dotnotation” (obj.property) for object access, one can usea “bracket notation” (obj[“property”]) instead. In thelatter case, moreover, the use of a string as an arrayindex makes applicable all of the string obfuscationtechniques mentioned earlier.

The host environment of web browser also providesvarious options for obfuscation. One approach is tosplit code into several parts, either in the same fileor even into multiple files stored among web servers.This technique is frequently seen with web-basedmalware. Another approach takes advantage of DOMinteraction. For example, data can be stored in theHTML file, outside the<script> block, then retrieved

1 <div id=’x’>HelloWorld</div>2 <script>3 a = document; b = "Id";4 var c = a[f()](’x’)[’i’+’nn’+’erH’+’TML’];5 function f(){6 var p=’ent’+’By’+b;7 var q=’get’+’Elem’;8 return q+p;9 }10 </script>

Figure 1. Example of obfuscated JavaScript code

using document.getElementById() at runtime. And, ofcourse,document.write() is a more powerful weaponthaneval(), which can be used in combination of thoseobfuscation techqniues mentioned above, to generatescript, document elements storing data and pointers toexternal documents, all at runtime.

Figure 1 presents an example of JavaScript codeobfuscated using some of the techniques discussedabove. Line 3 of this code snippet uses bracket notationto reference object property, using the strings ‘getEle-mentById’ (obtained as the concatenation of the strings‘get’, ‘ Elem’, ‘ ent’, ‘ By’, and ‘Id’) and ‘innerHTML’(obtained as the concatenation of the strings ‘i’, ‘ nn’,‘erH’, and ‘TML’) as array indices instead of usingthe more straightforward dot-notation to obtain thecorresponding property values. Furthermore, it uses theDOM methoddocument.getElementById() to retrievedata, namely, the string ‘HelloWorld.’ For simplicity ofexposition, this code uses a very straightforward obfus-cation of the array index strings, namely, concatenationof a few smaller strings; the code could, however, justas easily have used arbitrarily more complex obfusca-tions to construct these strings. The script is equivalentto var c = document.getElementById(’x’).innerHTML.The value finally assigned to variablec is a string’HelloWorld’ which is retrived from the HTML<div>element with ID’x’.

Those obfuscation techniques can be combined in ar-bitrary ways with multi-layered code unfolding, whichmakes it difficult to determine the intent of a JavaScriptprogram from a static examination of the programtext. To make it more challenging, the payload can bescattered in multiple code contexts at different levels,with each piece using various obfuscation techniquesand hidden in the garbage code whose only purpose isto confuse deobfuscators. We will show in Section IVthis trick can defeat existing JavaScript deobfuscators,which assume the unobfuscated, complete payload is

Page 4: Automatic Simplification of Obfuscated JavaScript Code: A ...genlu/pub/js-deobf-web.pdf · advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes

revealed in one of the (typically the last) unfoldedJavaScript code contexts.

There are also tools available for reducing the sizeof scripts [11], [33], usually by removing unnecessarywhitespaces and comments, and renaming symbols.This technique is called code compression or mini-fication, although it makes code difficult to read, thebehavior is still apparent. Therefore, we don’t considercode minification as obfuscation.

III. SEMANTICS-BASED DEOBFUSCATION

In this section, we describe the concept of semantics-based deobfuscation and the architecture of our proto-type system in more detail. In particular, we discusshow we collect execution information of JavaScriptprograms at runtime, and how we use dynamic slicingtechnique to identify “semantically relevant” code fromobfuscated script.

A. Overview

Semantics-based approach.Deobfuscation refersto the process of simplifying a program to removeobfuscation code and produce a simpler and function-ally equivalent program. Static analysis of obfuscatedJavaScript code usually cannot penetrate (possibly mul-tiple levels of) dynamic code unfolding to examinethe actual code that is materialized at runtime. Mostexisting JavaScript deobfuscators therefore resort todynamic analyses [6], [7], [10], [23], [30]. These tech-niques work for programs that match their assumptionsabout the code structure, but would have difficultyto automatically handle programs using more generalobfuscations.

The motivation of our approach is from the semanticintuition behind any deobfuscation process. In general,we cannot expect deobfuscation to produce the originalsource code for the program, either because the sourcecode is unavailable, or due to code transformationsapplied during obfuscation. All we can require, then,is that the process of deobfuscation must be semantics-preserving: i.e., that the code resulting from deobfus-cation beequivalentto the original program. For theanalysis of potentially-malicious code, a reasonablenotion of equivalence is that ofobservational equiv-alence, where two programs are considered equiva-lent if they behave—i.e., interact with their executionenvironment—in the same way. Since a program’s run-time interactions with the external environment occurthrough system calls, this means that two programs

are observationally equivalent if they execute identicalsequences of system calls (together with the argumentvectors to these calls).

This notion of equivalence suggests a simple ap-proach to deobfuscation: identify code that directly orindirectly affects the values of the arguments to systemcalls; these instructions are “semantically relevant”.Any remaining instructions, which are by definitionsemantically irrelevant, may be discarded. For theJavaScript code considered in this paper, the actualsystem calls are typically made from built-in browserroutines that appear as native functions. Our imple-mentation therefore uses native functions as a proxy forsystem calls: this is sound, but potentially conservativesince not all native functions lead to system calls.Then, to identify instructions that affect the valuesof native function arguments, we use dynamic slicing,applied at the byte-code level. One of the advantages ofdoing analysis at byte-code level is that the JavaScriptcompiler does part of the job for us: many obfuscationtechniques used to confuse human analysts or auto-mated script parsers can be revealed or removed aftercompilation. Examples of such tricks are discussed in[32], [34].

System overview.Our approach to deobfuscatingJavaScript code consists of the following steps, asshown in Figure 2:

1) Use an instrumented web browser to obtain anexecution trace for the JavaScript code underanalysis.

2) Construct a dynamic control flow graph fromcollected trace to determine the structure of theexecuted code.

3) Use our deobfuscation slicing algorithm to iden-tify semantically relevant instructions, i.e., in-structions that affect the externally-observable be-havior of the program. As previously discussed,externally-observable behavior is carried out bynative functions.

4) Decompile the dynamic control flow graph toan abstract syntax tree (AST) and label all thenodes constructed from resulting set of relevantinstructions.

5) Use semantics-preserving transformations to elim-inate invalid source-level constructs such asgotostatements. Finally, generate deobfuscated sourcecode by traversing the AST and printing onlylabeled (relevant) syntax tree nodes.

Our current implementation seperates trace collection

Page 5: Automatic Simplification of Obfuscated JavaScript Code: A ...genlu/pub/js-deobf-web.pdf · advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes

Figure 2. Semantics-based JavaScript deobfuscation: System overview

from the remaining steps: the generated trace is writtenout to a file, which is then read by the trace analyzer.This is purely for convenience, since it is conceptuallystraightforward to build the analysis facilities directlyinto the web browser. Our current implementationwrites out the abstract syntax tree obtained at the endof the above process in the form of JavaScript sourcecode, but one can also imagine directly applying othermalware analysis tools to the syntax tree itself.

B. Trace Collection

We use an instrumented Mozilla FireFox webbrowser to collect the program’s execution trace. Fire-Fox first compiles JavaScript source code to bytecodeand then executes it using its embeded SpiderMon-key interpreter. Since obfuscations commonly used bymalware take advantage of the built-in functionalityof JavaScript interpreter as well as document relatedoperations provided by the browser (see Section II-C),our instrumentation covers both the interpreter andDOM.

Each byte-code instruction instance generated byour instrumented web browser includes instruction’saddress, opcode mnemonic, length (in bytes), andoperands, together with any additional informationabout the instruction that may be relevant. In particular,we print the following information, which is used forsubsequent steps of the deobfuscation:

– expression stack: set of memory locations on thestack that are read and/or written by the instruc-tion;

– constants: encoded as part of the byte-code in-struction, can be an integer or a reference to anobject (i.e., string and floating point number), forthe latter case, the actual value of constant isretrieved and printed instead of the reference;

– conditional and unconditional branches: the offset(relative to the current instruction) of the branch

target.– global variables, array elements, and object prop-

erty accesses: which property of which object isbeing defined or used;

– local variables and arguments: an index specify-ing which local variable or argument amongst thefunction’s locals is being accessed;

– function calls: the reference to the callee (func-tion object) and the number of arguments beingpassed, together with a flag indicating whether thecallee is a native function;

– function references: the reference to the functionobject in which this intruction belongs to. Thisinformantion is used to identify functions whichis not called explicitly;

– document.write flags: a flag indicating currentinstruction is a call todocument.write() function;

– document elements: the reference to the documentelement that is created or accessed by functionssuch asgetElementById();

– unfolded code: string passed to code generatingoperations (i.e.eval(), document.write(), etc.)

As discussed in Section II-B, the execution of a non-native function generates a bytecode trace while a callto a native function does not. However, the call to thenon-native function can not be determined merely bythe existence of bytecode trace. There are native func-tions take other functions as arguments, i.e. callbacks.These callback functions are invoked implicitly by thenative function and generate bytecode trace, makes itsimilar to the execution of a non-native function. Onesuch example isstring.replace(), it takes a callbackfunction as argument, the callback will be invokedafter the match has been performed. The callback result(return value) will be used as the replacement string.Therefore, a flag is used to distinguish calls to nativeand non-native functions, and for each instructioninstance in the trace, we use the function reference

Page 6: Automatic Simplification of Obfuscated JavaScript Code: A ...genlu/pub/js-deobf-web.pdf · advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes

to indicate in which function this instruction belongs,in order to associate the execution and definition ofcallback functions. The callback functions can also beinvoked inside a non-native function, in that case, nospecial handling is required, therefore, in this paper,the termcallback is only used to refer to non-nativecallback functions invoked implicitly by native func-tion.

The references to document elements anddocu-ment.write flags are used to handle obfuscations involv-ing DOM operations, which is opaque to JavaScriptinterpreter, but is crucial for the purpose of deob-fuscating JavaScript in web pages: HTML documentelements can be created and modified dynamically, andare often used for storing data by obfuscated JavaScriptprograms (e.g., see Figure 1). Unlikeeval(), which isdirectly translated to an “eval” bytecode instruction, acall to document.write() is indistinguishable from othernative function calls. Thedocument.write flag is usedby our deobfuscation slicing algorithm to establish theconnection between HTML document and JavaScriptcode.

C. Control Flow Graph Construction

In principle, obtaining the static control flow graph(CFG) for a JavaScript program is possible. JavaScriptsource code is compiled into bytecode before exe-cution, and it is straightforward to decompile thisbytecode to an abstract syntax tree. In practice, thecontrol flow graph so obtained may not be very usefulif the intent is to simplify obfuscations away. Thereason for this is that dynamic constructs such aseval(), commonly used to obfuscate JavaScript code,are essentially opaque in the static control flow graph:their runtime behavior—which is what we are reallyinterested in—cannot be easily determined from aninspection of the static control flow graph. For thisreason, we opt instead for a dynamic control flowgraph, which is obtained from an execution trace ofthe program. However, while the dynamic control flowgraph gives us more information about the runtimebehavior of constructs such aseval(), it does so at thecost of reduced code coverage.

The algorithm for constructing a dynamic CFG froman execution trace is adapted from the algorithm forstatic CFG construction, found in standard compilertexts [3], modified to deal with dynamic executiontraces, plus the standard dominator analysis to iden-tify the loops. A more detailed discussion of issuesabout recovering the CFG from dynamic execution

trace, such as handling different instances of the sameinstruction and the lack of information about functions,are given in our previous paper [18].

One particular challenge for JavaScript dynamicCFG construction is how to deal with code generateddynamically. In order to distinguish code used forgenerating other code at runtime, from code used forother computation, we treat dynamically unfolded codesimilar to the way we handle non-native functions: aseperate CFG is constructed for each piece of dynam-ically unfolded code, as the function body; and theunfolding instruction (e.g.eval()) is treated as a call tothe function. This turns out to be conceptually simpleand also reflects the way in which theeval() constructis handled in the underlying implementation.

Another modification we made on standard CFGconstruction algorithm is the processing of callbackfunctions. For the function call instruction labeled asnative call, if the instruction follows it in the trace isnot adjacent (determined by instruction’s address), thenall the instructions between the native call instructionand its adjacent instruction are generated by (one ormore execution of) callback function. In this sub-trace,the boundary of each execution is identified and it isthen used for the CFG construction of correspondingcallback function. Also, as a preliminary step forapplying deobfuscation-slicing (discussed in SectionIII-D), the sub-trace of callback execution is cut offfrom the trace and associated with the call instructionto the native function which invokes the callback.

D. Semantic-Based Deobfuscation

Given the execution trace and control flow graph, thenext step is to identify the instructions that are seman-tically relevant to the program’s externally observablebehavior. For this, we use a variation on a programanalysis technique known as dynamic slicing.

In general, dynamic slicing is the problem of identi-fying, for a given execution of a programP , whichinstructions (or statements) inP actually affect thevalue of a given variable at a given point inP .Intuitively, this involves tracing dependencies betweenuses and definitions of variables; the issue is somewhatmore complicated in stack-based interpreters becausethe instructions that use the expression stack typicallydo not have their operands represented explicitly. Dy-namic slicing in such situations has been investigatedby Wang and Roychoudhury in the context of slicingJava byte-code traces [31]. We adapt their algorithm

Page 7: Automatic Simplification of Obfuscated JavaScript Code: A ...genlu/pub/js-deobf-web.pdf · advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes

Input : A dynamic traceT ; an instruction instanceinstr ∈ T ; a dynamic control flow graphG;

Output : A slice S;

1 S := ∅;2 currFrame := lastFrame := NULL;3 LiveSet :=∅;4 stack := a new empty stack;5 I := instruction instance at the last position in T;6 while true do7 inSlice := false;8 Uses := locations of data used byI;9 Defs := locations of data defined byI;

10 if I is property access by bracket notation∧ allinstances of the corresponding instruction accessthe same namethen

11 Uses := Uses - location of the string argument;12 end

13 inSlice := I is instr;14 if I is a return instruction then15 push a new frame on stack;16 else if I is an interpreted function callthen17 lastFrame := pop(stack);18 else19 lastFrame = NULL;20 end21 currFrame := top frame on stack;

22 if I is an interpreted function call∧ I is not eval∧ I is not code-unfoldingdocument.write then

23 inSlice := inSlice∨ lastFrame is not empty;24 else if I is a control transfer instructionthen25 for each instructionJ in currFrame s.t.J is

control-dependent onI do26 inSlice := true;27 removeJ from currFrame;28 end29 end

30 inSlice := inSlice∨ (LiveSet∩ Defs 6= ∅);31 LiveSet := LiveSet− Defs;32 if inSlice then33 addI into S;34 addI into currFrame;35 LiveSet := LiveSet∪ Uses;36 end

37 if I is not the first instruction instance in Tthen38 I := previous instruction instance in T;39 else40 break;41 end42 end

Algorithm 1: Deobfuscation-Slicing algorithm

in two ways, both having to do with the dynamicfeatures of JavaScript used extensively for obfuscation.The first is that while Wang and Roychoudhury use astatic control flow graph, we use the dynamic controlflow graph discussed in Section III-C. The reason forthis is that in our case a static control flow graphdoes not adequately capture the execution behaviorof exactly those dynamic code unfolding constructs,such aseval() and document.write(), that we needto handle when dealing with obfuscated JavaScript.The second is in the treatment of dynamic constructsduring slicing, such as code-unfolding and the bracketnotation. Consider a statementeval(s): in the contextof deobfuscation, we have to determine the behaviorof the code obtained from the strings; the actualconstruction of the strings, however, is simply partof the obfuscation process and is not directly relevantfor the purpose of understanding the functionality ofthe program. When slicing, therefore, we do not followdependencies through any code unfolding statements.

A different approach is used to handle object prop-erty accesses using bracket notation. This access mech-anism is useful in situations where different executionsof the same piece of bracket-notation code accessdifferent object properties, e.g., when iterating overa list of properties in a loop. In such situations, thecode used to construct the various strings used for thebracket-notation access is relevant to understanding thebehavior of the program, and should be included inthe slice. On the other hand, if every instance of thebracket-notation code accesses the same property, thenthis access mechanism provides no additional benefitcompared to the more common dot-notation access;arguably, it makes the code a little harder to understand(especially if the string being used for the bracket-notation access is constructed dynamically), and sois obfuscatory. For this reason, given an instructioninstanceI in the trace that uses a bracket-notationaccess, we check whether all the instances ofI inthe trace access the same property names: if so, thedependency fromI to the code that constructss is notfollowed; instead, the constructed names is used indecompilation stage. Otherwise, if different instancesof I use different property namess, the dependencyfrom I to the code that constructss is treated normallyand the string construction code is included.

The key to any dynamic slicing based techniqueis to accurately capture the data and control flowof the target program, which is especially challeng-ing for slicing JavaScript code in web pages: DOM

Page 8: Automatic Simplification of Obfuscated JavaScript Code: A ...genlu/pub/js-deobf-web.pdf · advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes

enables the interaction between JavaScript code andthe host HTML document, but the document relatedinformation is opaque to the JavaScript interpreter. Forexample, a program can store data in a HTML elementand retrieve it later, similar to the usage of variables(e.g., see Figure 1). Furthermore, JavaScript programcan also perform code unfolding at runtime usingDOM methods (e.g.document.write()), either directlyor from an external source. Therefore, to precisely sliceJavaScript program, it is important to keep track of theconnection between the byte-code used by interpreterand DOM operations. To this end, our slicing algorithmconsiders extra information provided by instrumentedDOM, as discussed in Section III-B. More specifically,the reference to the document element is printed at thepoint where it is accessed by DOM methods, whichrecovers the data dependence hidden in the nativefunctions.

In additon, we treatdocument.write() specially.doc-ument.write() method writes a string of text to theHTML document, which could be used both for basicdocument manipulation and code unfolding. Those twocases are handled differently, depends on whether newcode is introduced and executed. If JavaScript codeis dynamically generated and executed by calling thismethod, then our slicing algorithm handles the callexactly the same way aseval(): cut the dependencebetweendocument.write() and the code generated. Oth-erwise, we consider it a regular native function callwhich is used for output purpose.

The handling of callback functions requires some ex-tra processing. Assuming there is a call to native func-tion f which takes callbackc as an argument, in orderfor our slicing algorithm to follow the dependencyaccurately, we need to explicitly make connectionsbetween the return value off and the return value fromeach execution ofc, as well as the arguments passed tof and parameters ofc. More specifically, if the call tof is included in the slice because of data dependencyon f ’s return value, then return instructions from allthe execution ofc should be included in the slice; ifan instruction included in slice is dependent on anyparameter ofc, then the call instruction invokesfshould also be included in slice.

We refer to this algorithm as deobfuscation-slicing.The pseudocode is shown in Algorithm 1. Lines1− 5

are initialization. The algorithm traverses the executiontrace backwards, processing each instruction in orderfrom the last instruction to the first. Lines8−9 extractsfrom the trace the set of memory locations on the

stack that are read and/or written by the instruction,and similarly for properties and DOM elements. Lines10− 12 cut the dependency for property access usingbracket notation, as discussed above. If we encountera return instruction, this instruction must be in acallee function, and since the trace is being traversedbackwards we push a new frame on the stack (line 14);analogously, when we encounter a call to an interpretedfunction (native functions are not traced), we pop thestack because the call instruction is in the caller (line16). The underlying implementation handles dynamiccode generation viaeval() anddocument.write() like afunction call; line 22 of our algorithm ignores codeunfolding, as discussed above. To keep the descriptionof deobfuscation-slicing algorothm clear and concise,the detail of callback processing is omitted from pseu-docode.

Input : A dynamic traceT ; a dynamic control flowgraphG;

Output : All relevant instructionsR;

1 R ← ∅;2 U ← ∅;3 for i← length of T to 1 do4 instr ← i-th instruction instance in T;5 if instr is a code unfolding instructionthen6 U ← U ∪ Deobfuscation-Slicing(T , instr,

G);7 end8 end9 for i← length of T to 1 do

10 instr ← i-th instruction instance in T;11 if instr is a native function call∧ instr /∈ U then12 R ← R ∪ Deobfuscation-Slicing(T , instr, G);13 end14 end

Algorithm 2: Semantics-based deobfuscation algo-rithm.

The deobfuscation-slicing algorithm only solves halfof the puzzle, we still have to determine which instruc-tions affect program’s behavior, i.e. on which instruc-tions to apply deobfuscation-slicing. Our approach ofsemantics-based deobfuscation consists of two basicsteps, both steps rely on the deobfuscation-slicingalgorithm. The pseudocode is shown in Algorithm 2:

1) identify all instructions relevant to code unfold-ing (line 3-8). First, the algorithm traverses theexecution trace in order from the last instructioninstance to the first, for each instance of dynamiccode unfolding instructions in trace (e.g.eval()

Page 9: Automatic Simplification of Obfuscated JavaScript Code: A ...genlu/pub/js-deobf-web.pdf · advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes

and call todocument.write()), the deobfuscation-slicing algorithm is applied on it to identifyinstruction instances relevant to code unfolding,which include native function calls contribute todynamic code generation. After this step, setUcontains all the instructions in traceT that arerelevant to code unfolding.

2) identify all instructions relevant to observablebehavior (line 9-14). The algorithm traversesthe trace backwards, applying the deobfuscation-slicing algorithm on each call to the native func-tion which is irrelevant to code unfolding (thosenot in setU as identified in the first step). Theresulting setR contains instructions semanticallyrelevant to the observable behavior of the script.

E. Decompilation

The slicing step described above identifies instruc-tions in the dynamic trace that directly or indirectlyaffect arguments to native function calls, which in-cludes functions that invoke system calls. Instead of re-computing a control flow graph considering only thoserelevant instructions, we adopt a simpler approachfor decompilation: transform theoriginal control flowgraph to the higher-level representation such as anabstract syntax tree (AST), and label those AST nodesconstructed from relevant instrucions. This way, weavoid the complexity of handling protential problemscaused by slicing, for example, basic blocks might bescattered and the branching target instruction might notin the slice.

A program in the byte-code representation of Spi-derMonkey can not be directly converted into validJavaScript source code, due to the existence of thoselow level branch instructions, e.g.ifne, goto, etc.Therefore, as the first step, we usegoto statement torepresent those branch operations in AST. Since theCFG has already been processed using loop analysisand function indentification, we need to construct anabstract syntax tree for each function. The basic blocksof the CFG are traversed in depth first order on thecorresponding dominance tree,goto node is createdin two cases: at the end of basic block that doesn’tend with a brach instruction, or whenever a branchinstruction is encountered. In addition to storing in-formation of target block ingoto nodes, we also keeptrack of a list of precedinggoto nodes in each targetnode. Once every basic block has been translated to anAST node, loop structures are constructed by creatinginfinite while loop node which, initially, contains only

the nodes of corresponding natural loop obtained fromsection III-C. Once we have an extened AST withgoto nodes, additional code transformation is appliedto generate valid JavaScript soure code. Basic blocknode and loop node in AST will be refered asblocknode.

F. Code Transformation

Introducing goto statments during decompilationallows us to apply a straightforward algorithm toconstruct AST, but JavaScript source code generateddirectly from this AST is invalid. To recover valid code,we need to transform the extended AST to eliminategoto statements, without changing the logic of theprogram.

Joelsson proposed agoto removal algorithm for de-compilation of Java byte-code with irreducible CFGs,the algorithm traverses the AST over and over andapplies a set of transformations whenever possible [14].We adapt this algorithm to handle JavaScript and theinstruction set used by the SpiderMonkey JavaScriptengine [21]. The basic idea is to transform the programso that eachgoto is either replaced by some otherconstruct, or thegoto and its target are brought closertogether in a semantics-preserving transformation. Thetransformation stops when none of the rules above canbe applied to the AST. The resulting syntax tree istraversed one last time, for each node labeled by thedecompiler described in section III-E, correspondingsource code has been printed out. Again, the detaileddescription of our transformation rules can be found in[18] .

G. Attacking our Algorithm

Intuitively, there are two ways by which an attackermight attempt to evade our approach to deobfuscation.The first is to hide relevant instructions, by adding fakedependency between them and strings to be unfolded.Our approach is immune to this technique, becausean unfolded strings depends on some codev doesn’tautomatically excludev from the resulting slice; if thereal workload depends onv, thenv would be added toslice regardless of the connection with code unfoldingoperation. In other words, only code which is solelyused for obfuscation would be eliminated. The secondevasion technique is to disguise the obfuscation codeas relevant by adding extra irrelevant native functioncalls and creating dependencies between the obfusca-tion code and those irrelevant calls. Our semantics-based approach can not automically simplify away this

Page 10: Automatic Simplification of Obfuscated JavaScript Code: A ...genlu/pub/js-deobf-web.pdf · advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes

Figure 3. The test programsP1 andP2

kind of disguised obfuscation because, in general, theadditional native function calls potentially change theobservable behavior of the program. One approach tomitigating such attacks is to select, either manually orautomatically, a (possibly proper) subset of the nativefunction calls in the program that are used as the basisfor the slicing process described above. A detaileddiscussion of this issue is beyond the scope of thispaper.

IV. EXPERIMENTAL EVALUATION

We evaluated the efficacy of our ideas using a pro-totype implementation based on Mozilla’s open sourceFireFox web browser, which uses SpiderMonkey as itsJavaScript engine. We tested this prototype on threesynthetic programs as well as an actual JavaScriptmalware sample obtained from the Internet. First, weused two versions of the familiar Fibonacci program:this was chosen, first because it contains a variety oflanguage constructs, including conditionals, recursivefunction calls, and arithmetic; and second because itis small and familiar, which makes it easy to assessthe quality of deobfuscation. Our third synthetic testcase is a very simple program that obfuscated so as todistribute its “payload” over multiple code contexts.This poses a problem for most existing JavaScriptdeobfuscators, which assume that the entirety of thepayload is contained in one of the unfolded JavaScriptcode contexts. Finally, we tested our prototype usinga sample of actual malicious code obtained from theInternet by using the ‘wget’ command to retrieve thecontents of a URL extracted from a spam email sentto one of the authors.

Figure 3 shows two version of Fibonacci numbercomputation programs. The first one,P1 is shown

Figure 4. Fragments of obfuscated versions of the programP1

Figure 5. Deobfuscator outputs for programsP1 andP2

in Figure 3(a), this program is hand-obfuscated toincorporate multiple nested levels of dynamic codegeneration usingeval for each level of recursion. Thesecond program,P2, as shown in Figure 3(b), isalso hand-obfuscated, in which we added dependencybetween real workload and the value used byeval(local variablex in function fib). Three versions ofeach of these programs are used—the program as-is as well as two obfuscated versions—one using anobfuscator we wrote ourselves that uses many of theobfuscation techniques described in Section II-C, in-cluding DOM operation; and an online obfuscator [2].Figures 4 shows the fragments of obfuscated programscorresponding toP1; the obfuscated code forP2 arevery similar and not shown separately due to spaceconstraints.

The output of our deobfuscator for these programs isshown in Figure 5. Figure 5(a) shows the deobfuscated

Page 11: Automatic Simplification of Obfuscated JavaScript Code: A ...genlu/pub/js-deobf-web.pdf · advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes

Figure 6. Unfolded code contexts from obfuscated version ofprogramsP3. the original code ofP3 is highlighted. Some smallercode contexts are omitted.

code for all three versions ofP1 (the original code,shown in Figure 3(a), as well as the two obfuscatedversions shown in Figure 4). Figure 5(b) shows thedeobfuscated code for all three versions ofP2. Foreach ofP1 and P2, the deobfuscator outputs are thesame for all of the three versions. It can be seen thatthe recovered code is very close to the original, andexpresses the same functionality. The results obtainedshow that the technique we have described is effectivein simplifying away obfuscation code and extracingthe underlying logic of obfuscated JavaScript code.This holds even when the code is heavily obfuscatedwith multiple different kinds of obfuscations, includingruntime decryption of strings and multiple levels of dy-namic code generation and execution usingeval() anddocument.write(), in particular, from simplified codeof P2 (Figure 5(b)), we could see that our approachhandles those code intented to be “hidden” byevalcorrectly.

All obfuscations shown in Figure 3 and 4 are typicaltechniques widely used in the wild, they also satisfythe assumption made by current deobfuscators: theunobfuscated, complete payload is revealed in one

Figure 7. Deobfuscator outputs for programsP3

of the unfolded JavaScript code contexts, i.e. if thedeobfuscator simply examines every string passed tocode unfolding operations such aseval() and docu-ment.write(), the unobfuscated payload can be directlyidentified in one of them. Our next test program,P3, ispurposely constructed to violate this assumption. Forillustrative purpose, we make the original logic ofP3

very simple, which consists of only three statements:

b=0; ++b; alert(b);

This code is manually obfuscated by hiding eachstatement into obfuscated variants ofP1 and P2, infour steps. First, we remove the calls to native functionalert() in P1 and P2 in Figure 3. Next, the modifiedP1 andP2 are obfuscated using the online obfuscator[2]. Then we insert first two statements ofP3 intothese two obfuscated programs, as if a part of theobfuscation process, and concatenate them together.Last, we apply one more level of obfuscation to thecode from last step, and attach the third statementof P3 to its end. Figure 6 shows the unfolded codecontexts of obfuscatedP3 as described above. Figure6(a) is the topmost level obfuscation, statementalert(b)of P3 resides in this context. Figure 6(b) is the contextunfolded by theeval() in Figure 6(a). This context con-sists of obfuscated Fibonacci number programs, andthe first two statements ofP3 are hidden in these twoobfuscation processes consecutively. Figure 6(c) and(d) present the logic of Fibonacci number computation,from P1 andP2, both unfolded by context of Figure6(b). Some of the smaller code contexts generated arenot shown here.

Although P3 is extremenely simple, identifying itsoriginal logic from unfolded contexts in Figure 6 isstill challenging: the original code is scattered amongdifferent code contexts at different obfuscation levels,hidden in garbage code; and it is easy to misidentifyP3

as Fibonacci computation. Therefore, as we can see,deobfuscators adopt the simple “context-unfolding”technique is very ineffective against the obfuscationwhich does’t satisfy its assumption. In comparison,

Page 12: Automatic Simplification of Obfuscated JavaScript Code: A ...genlu/pub/js-deobf-web.pdf · advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes

<script>laKKs=’mCha’;jJt=’’;n9gs="37G51G67G105G102G114G97G109G101G37G50G48G115G114G9

...7 lines deleted...

51G67G47G105G102G114G97G109G101G37G51G69";ngs=document;n9gs=n9gs["split"](’G’);for(i=0;i<n9gs.length;i++){jJt+=String[’fro’+laKKs+’rCode’](n9gs[i])};ngs["w"+"rite"](unescape(jJt))</script>

Figure 8. Source code for malware sampleP4

Figure 9. Execution flow of malware sampleP4

Figure 7 presents the output of our deobfuscator, inwhich most of the obfuscation and garbage code areremoved, recovered code is very close to the originalP3, and expresses the same functionality. The extracode (functionf0 and f4) is introduced because of thecontrol dependency, which is possible to be simplifiedaway by further analysis, e.g. in this case, since noneof the arguments is relevant, the invocation of thefunctions can be simply replaced by their body. Weleave this for future work.

Finally, we evaluated our prototype system using aJavaScript malware sample,P4, which we collectedfrom the Internet as described earlier. The completecode contexts ofP4 is presented in Appendix A; Figure8 shows the initial obfuscated JavaScript, while Figure9 shows its high-level dynamic structure. Context 1resides in the web page opened by web browser; itis a small piece of obfuscated JavaScript code (seeFigure 8) that invokesdocument.write() method todynamically insert a hidden iFrame, and cause theload of an external web page. This newly loaded webpage contains more obfuscated code, which consists

Figure 10. Deobfuscator output for malware sampleP4

context 2 in Figure 9. Similarly, context 2 causesone more level of code unfolding usingeval() andgenerates context 3. context 3 is the intended payload,it opens a PDF file that exploits a vulnerability inAdobe Reader, this action is also conducted using adynamically created hidden iFrame. Figure 10 showsthe output of our deobfuscator. The recovered code isvery close to hand deobfuscated result and capturesthe essence of the malicious behavior ofP4. Thecall to document.write() in context 3 is part of thesimplified code since it is used as a method foroutputas discussed in Section III-D.

V. RELATED WORK

Most current approaches to dealing with obfuscatedJavaScript typically require a significant amount ofmanual intervention, e.g., to modify the JavaScriptcode in specific ways or to monitor its executionwithin a debugger [19], [22], [32]. There are alsoapproaches, such as Caffeine Monkey [10], intendedto assist with analyzing obfuscated JavaScript code, byinstrumenting JavaScript engine and logging the actualstring passed toeval. Similar tools include severalbrowser extensions, such as the JavaScript Deobfus-cator extension for Firefox [23]. The disadvantage ofsuch approaches is that they show all the code that isexecuted and do not separate out the code that pertainsto the actual logic of the program from the code whoseonly purpose is to deal with obfuscation.

Recently a few authors have begun looking atautomatic analysis of obfuscated and/or maliciousJavaScript code. Covaat al. [4] and Curtsingeret al.[7] describe the use of machine learning techniquesbased on a variety of dynamic execution featuresto classify Javascript code as malicious or benign.Such techniques typically do not focus on automaticdeobfuscation, relying instead on the heuristics basedon behavioral characteristics. Since obfuscation canalso be found in benign code and really is simply anindicative of a desire to protect the code against casualinspection, classifiers that rely on obfuscation-orientedfeatures are not reliable indicators of malicious intent.Our automatic deobfuscation approach can potentiallyincrease the accuracy of such techniques by exposingthe actual logic of the code. Saxenaet al. discussdynamic symbolic execution of JavaScript code usingconstraint-solving over strings [28]. Hallaraker andVigna describe an approach to detecting maliciousJavaScript code by monitoring the execution of theprogram and comparing the execution to a set of high-

Page 13: Automatic Simplification of Obfuscated JavaScript Code: A ...genlu/pub/js-deobf-web.pdf · advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes

level policies [12]. All of these works are very differentfrom the approach discussed in this paper.

There is a rich body of literature dealing with dy-namically generated (“unpacked”) code in the contextof native-code malware executables [5], [15], [20],[27]. Much of this work focuses on detecting unpack-ing and identifying the unpacked code. By contrast, thework described here is not concerned with the identi-fication and extraction of dynamically-generated codeper se, but focuses instead on identifying instructionsthat are relevant to the externally-observable behaviorof the program.

VI. D ISCUSSION ANDFUTURE WORK

Our approach requires instrumenting the JavaScriptinterpreter within a web browser to write out a traceof the byte code being executed. Because this requiresinserting code into the interpreter, our current prototypeis implemented in the context of the open-sourceFirefox web browser. However, none of our ideas areFirefox-specific, and in principle they could be adaptedin a straightforward way to any browser whose sourcecode is available.

A potential concern with dynamic approaches, suchas ours, is that of code coverage: in theory, staticanalyses can examine all of the code in a programwhile dynamic analyses can only examine the codethat lies on a particular execution path. In practice,however, current static analyses for JavaScript are notable to actually penetrate constructs such aseval()and analyze the code generated at runtime; rather,they rely on syntactic heuristics such as the presenceof redirection, calls toeval(), and code/data entropy,to classify whether or not the code is potentiallymalicious. Examination of the simplified JavaScriptcode obtained from a tool such as ours can make iteasier to identify inputs that would cause the code toexecute alternative execution paths, which could thenbe explored in a suitable browser environment. Weleave an exploration of this issue as future research.

Another area of future research is the extension ourapproach to allow identification of attack code thatdoes not have any other semantic significance. Forexample, code whose only purpose is to position heapbuffers for a heap spray attack [8], [29], or stringscrafted to contain machine code that is written intosuch buffers, is currently not detected by our slicingalgorithm unless there is also an explicit connectionwith an argument to a native function call. Intuitively,

what is going on is that our slicing algorithm startswith a notion of a set of “interesting” data values andidentifies all of the other code that affects these values;the problem is that our current notion of “interestingvalues,” limited to arguments to native function calls,is too narrow to capture such attack code. We intend toexplore ways to address this issue by broadening thenotion of interesting data values appropriately, e.g., byalso including strings that appear to contain machinecode [9].

Finally, there are a few minor aspects of JavaScriptand its DOM interactions that we have not yet hadtime to implement fully within our decompiler. Forexample, our JavaScript decompiler currently doesnot handle exception-handling viatry-catch statements.This is straightforward implementation work and doesnot present significant conceptual challenges.

VII. C ONCLUSIONS

The common use of JavaScript code for web-basedmalware delivery makes it important to be able analyzethe behavior of JavaScript programs and, possibly,classify them as benign or malicious. For maliciousJavaScript code, it is useful to have automated toolsthat can help identify the functionality of the code.However, such JavaScript code is usually highly obfus-cated, and use dynamic language constructs that makeprogram analysis difficult.

We present a semantics-based approach for auto-matic deobfuscation of JavaScript code. We use dy-namic analysis and program slicing techniques to sim-plify away the obfuscation and expose the underlyinglogic of the JavaScript code in web pages. Moreover,this approach does not make any assumptions aboutthe structure of the obfuscation. Experiments using aprototype implementation indicate that our techniqueis effective even against highly obfuscated JavaScriptprograms.

ACKNOWLEDGEMENT

This work was supported in part by the NationalScience Foundation via grant nos. CNS-1016058 andCNS-1115829, and the Air Force Office of ScientificResearch via grant no. FA9550-11-1-0191.

REFERENCES

[1] jsunpack: A generic JavaScript unpacker.http://jsunpack.jeek.org/.

Page 14: Automatic Simplification of Obfuscated JavaScript Code: A ...genlu/pub/js-deobf-web.pdf · advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes

[2] Online javascript obfuscator.http://www.daftlogic.com/projects-online-javascript-obfuscator.htm.

[3] A.V. Aho, R. Sethi, and J.D. Ullman. Compilers:principles, techniques, and tools. Pearson/AddisonWesley, 1986.

[4] D. Canali, M. Cova, G. Vigna, and C. Kruegel.Prophiler: A fast filter for the large-scale detectionof malicious web pages. InProc. 20th InternationalConference on World Wide Web, pages 197–206, 2011.

[5] K. Coogan, S. Debray, T. Kaochar, and G. Townsend.Automatic static unpacking of malware binaries. InProc. 16th. IEEE Working Conference on ReverseEngineering, pages 167–176, October 2009.

[6] M. Cova, C. Kruegel, and G. Vigna. Detection andanalysis of drive-by-download attacks and maliciousJavaScript code. InProc. 19th International Confer-ence on World Wide Web, pages 281–290, 2010.

[7] C. Curtsinger, B. Livshits, B. Zorn, and C. Seifert.Zozzle: Fast and precise in-browser JavaScript malwaredetection. InUSENIX Security Symposium, 2011.

[8] M. Daniel, J. Honoroff, and C. Miller. Engineer-ing heap overflow exploits with javascript. InProc.Second USENIX Workshop on Offensive Technologies(WOOT), 2008.

[9] M. Egele, P. Wurzinger, C. Kruegel, and E. Kirda. De-fending browsers against drive-by downloads: Mitigat-ing heap-spraying code injection attacks. InProc. 6thInternational Conference on Detection of Intrusionsand Malware, and Vulnerability Assessment, pages 88–106, 2009.

[10] B. Feinstein and D Peck. Caffeine Monkey: Auto-mated collection, detection and analysis of maliciousjavaScript.Black Hat USA, 2007.

[11] Google. Minify. http://code.google.com/p/minify/.

[12] O. Hallaraker and G. Vigna. Detecting MaliciousJavaScript Code in Mozilla. InProceedings of theIEEE International Conference on Engineering ofComplex Computer Systems (ICECCS), pages 85–94,Shanghai, China, June 2005.

[13] F. Howard. Malware with your mocha:Obfuscation and anti emulation tricks inmalicious JavaScript, September 2010.http://www.sophos.com/security/technical-papers/malwarewith your mocha.pdf.

[14] E. Joelsson. Decompilation for visualization of codeoptimizations. Master’s thesis, Royal Institute of Tech-nology, 2003.

[15] M. G. Kang, P. Poosankam, and H. Yin. Renovo: Ahidden code extractor for packed executables. InProc.Fifth ACM Workshop on Recurring Malcode (WORM2007), November 2007.

[16] S. Kaplanet al. “NOFUS: Automatically Detecting”+String. fromCharCode (32)+ “ObFuSCateD ”.toLow-erCase()+ “JavaScript Code”. Technical report, Mi-crosoft Research, 2011.

[17] A. Kirk. Gumblar and more on Javascriptobfuscation. Sourcefire Vulnerability ResearchTeam. http://vrt-blog.snort.org/2009/05/gumblar-and-more-on-javascript.html. May 22, 2009.

[18] Gen Lu, Kevin Coogan, and Saumya Debray. Au-tomatic simplification of obfuscated JavaScript code(extended abstract). InProc. ICISTM-12 Work-shop on Program Protection and Reverse Engineer-ing (PPREW), March 2012. Full version availableat: http://www.cs.arizona.edu/˜debray/Publications/js-deobf-full.pdf.

[19] P. Markowski. ISC’s four methods ofdecoding JavaScript + 1, March 2010.http://blog.vodun.org/2010/03/iscs-four-methods-of-decoding.html.

[20] L. Martignoni, M. Christodorescu, and S. Jha. Omni-Unpack: Fast, Generic, and Safe Unpacking of Mal-ware. InProc. ACSAC ’07, December 2007.

[21] Mozilla. SpiderMonkey.https://developer.mozilla.org/en/SpiderMonkey.

[22] J. Nazario. Reverse engineering ma-licious JavaScript. CanSecWest 2007,http://cansecwest.com/csw07/csw07-nazario.pdf.

[23] Wladimir Palant. FireFox add-on: JavaScriptdeobfuscator 1.5.7. https://addons.mozilla.org/en-US/firefox/addon/javascript-deobfuscator/.

[24] N. Provos, P. Mavrommatis, M.A. Rajab, and F. Mon-rose. All your iFRAMEs point to us. InProc. 17thUSENIX Security Symposium, pages 1–15, 2008.

[25] N. Provos, D. McNamee, P. Mavrommatis, K. Wang,and N. Modadugu. The ghost in the browser analysisof web-based malware. InProceedings of the firstconference on First Workshop on Hot Topics in Un-derstanding Botnets, pages 4–4. USENIX Association,2007.

Page 15: Automatic Simplification of Obfuscated JavaScript Code: A ...genlu/pub/js-deobf-web.pdf · advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes

[26] N. Provos, M.A. Rajab, and P. Mavrommatis. Cyber-crime 2.0: when the cloud turns dark.Communicationsof the ACM, 52(4):42–47, 2009.

[27] P. Royal, M. Halpin, D. Dagon, R. Edmonds, andW. Lee. Polyunpack: Automating the hidden-codeextraction of unpack-executing malware. InProc.ACSAC ’06, pages 289–300, 2006.

[28] P. Saxena, D. Akhawe, S. Hanna, F. Mao, S. McCa-mant, and D. Song. A symbolic execution frameworkfor JavaScript. InProceedings of the 31st IEEESymposium on Security and Privacy, pages 513–528,Oakland, CA, USA, 2010.

[29] A. Sotirov. Heap feng shui in JavaScript.In Black Hat USA 2007, July 2007.https://www.blackhat.com/presentations/bh-usa-07/Sotirov/Whitepaper/bh-usa-07-sotirov-WP.pdf.

[30] Boban Spasic. Malzilla.http://malzilla.sourceforge.net/.

[31] T. Wang and A. Roychoudhury. Dynamic slicing onjava bytecode traces.ACM TOPLAS, 30(2):10, 2008.

[32] D. Wesemann. Advanced obfus-cated JavaScript analysis, April 2008.http://isc.sans.org/diary.html?storyid=4246.

[33] Yahoo! YUI compressor.http://developer.yahoo.com/yui/compressor/.

[34] Bojan Zdrnja. Advanced JavaScript obfuscation (orwhy signature scanning is a failure), April 2009.http://isc.sans.edu/diary.html?storyid=6142.

APPENDIX

COMPLETE CODE CONTEXTS OFP4

This section presents the complete code contexts ofmalware sampleP4 discussed in SectionIV.

Figure 11 presents the first context, it creates ahidden iFrame dynamically which points to a web page(i.e. code context 2) from a malicious site. The secondcontext, as shown in Figure 12, is unfolded by context1. It assembles the exploit code by decrypting infor-mation retrieved from a pair of hidden DOM elements,then executes it usingeval. Figure 13 presents the realexploit code ofP4, which tries to detect whether youhave the Adobe reader plugin in your browser andloads the payload — a PDF file created to exploita vulnerability in the Adobe PDF reader JavaScriptengine.

Figure 11. The first code context ofP4

Figure 13. The third code context ofP4

Page 16: Automatic Simplification of Obfuscated JavaScript Code: A ...genlu/pub/js-deobf-web.pdf · advantage of the dynamic nature of JavaScript to create highly obfuscated code. This makes

Figure 12. The second code context ofP4