14
CONFUSE: LLVM-based Code Obfuscation Chih-Fan Chen, [email protected] Theofilos Petsios, [email protected] Marios Pomonis, [email protected] Adrian Tang, [email protected] Abstract In the past decade, code obfuscation tech- niques have become increasingly popular due to their wide applications on malware and the numerous violations of intellectual property caused by reverse engineering. In this work, we examine common techniques used for code obfuscation and provide an outline of the design principles of our tool Confuse. Confuse is an LLVM tool which modifies the standard compilation steps to produce an obfuscated binary from C source code. Keywords: code obfuscation, CFG alter- ation, hardening, reverse engineering 1 Introduction The distribution of current software applica- tions is plagued with vast financial losses due to reverse engineering [1]. The term refers to the process of discovering the technologi- cal principles of a device, object, or system through the analysis of its structure, function, and operation. It involves recovering and mak- ing sense of the higher-level semantics of a compiled binary. Once a company develops an application and releases it, it ceases to have a strong control over who accesses their source code and who doesn’t. Rivals or adversaries can gain access to this code and use it for their own benefit without investing as much resources as the original authors. State-of-the- art disassemblers (eg. IDA Pro [2]) and de- compilers (eg. Hex-Rays Decompiler [3]) en- able people to readily reverse-engineer the in- ner workings of an executable. The effective- ness of the reverse engineers’ efforts is depen- dant on the amount of time and resources they are willing to put into reversing a particular piece of code. With such threat to intellectual property, there is a strong motivation to find effective ways to protect compiled software. Code obfuscation involves performing a se- ries of transformations on software, convert- ing it to a harder-to-be reverse-engineered form, which maintains the unmodified soft- ware’s functionality with a reasonable over- head. The primary motivation for code ob- fuscation is to protect as much as possible any intellectual property, such as algorithms and data structures, inherent in a software applica- tion that is being sold and distributed. Raising the bar for reverse-engineering, code obfusca- tion attempts try to harden, in terms of time and resources, the disassembling attempts per- formed by automated systems and possible ad- versaries. Our project aims to develop a code obfus- cating tool which takes C source code as input and produces an obfuscated binary, which will be more resistant to being reversed-engineered in comparison to the respective unobfuscated executable. 1

CONFUSE: LLVM-based Code Obfuscationaho/cs4115_Spring-2013/... · CONFUSE: LLVM-based Code Obfuscation Chih-Fan Chen, [email protected] Theo los Petsios, [email protected] Marios

  • Upload
    others

  • View
    28

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CONFUSE: LLVM-based Code Obfuscationaho/cs4115_Spring-2013/... · CONFUSE: LLVM-based Code Obfuscation Chih-Fan Chen, cc3500@columbia.edu Theo los Petsios, tp2392@columbia.edu Marios

CONFUSE: LLVM-based Code Obfuscation

Chih-Fan Chen,[email protected]

Theofilos Petsios,[email protected]

Marios Pomonis,[email protected]

Adrian Tang,[email protected]

Abstract

In the past decade, code obfuscation tech-niques have become increasingly popular dueto their wide applications on malware and thenumerous violations of intellectual propertycaused by reverse engineering. In this work,we examine common techniques used forcode obfuscation and provide an outline ofthe design principles of our tool Confuse.Confuse is an LLVM tool which modifies thestandard compilation steps to produce anobfuscated binary from C source code.

Keywords: code obfuscation, CFG alter-ation, hardening, reverse engineering

1 Introduction

The distribution of current software applica-tions is plagued with vast financial losses dueto reverse engineering [1]. The term refersto the process of discovering the technologi-cal principles of a device, object, or systemthrough the analysis of its structure, function,and operation. It involves recovering and mak-ing sense of the higher-level semantics of acompiled binary. Once a company develops anapplication and releases it, it ceases to have astrong control over who accesses their sourcecode and who doesn’t. Rivals or adversariescan gain access to this code and use it for

their own benefit without investing as muchresources as the original authors. State-of-the-art disassemblers (eg. IDA Pro [2]) and de-compilers (eg. Hex-Rays Decompiler [3]) en-able people to readily reverse-engineer the in-ner workings of an executable. The effective-ness of the reverse engineers’ efforts is depen-dant on the amount of time and resources theyare willing to put into reversing a particularpiece of code. With such threat to intellectualproperty, there is a strong motivation to findeffective ways to protect compiled software.

Code obfuscation involves performing a se-ries of transformations on software, convert-ing it to a harder-to-be reverse-engineeredform, which maintains the unmodified soft-ware’s functionality with a reasonable over-head. The primary motivation for code ob-fuscation is to protect as much as possible anyintellectual property, such as algorithms anddata structures, inherent in a software applica-tion that is being sold and distributed. Raisingthe bar for reverse-engineering, code obfusca-tion attempts try to harden, in terms of timeand resources, the disassembling attempts per-formed by automated systems and possible ad-versaries.

Our project aims to develop a code obfus-cating tool which takes C source code as inputand produces an obfuscated binary, which willbe more resistant to being reversed-engineeredin comparison to the respective unobfuscatedexecutable.

1

Page 2: CONFUSE: LLVM-based Code Obfuscationaho/cs4115_Spring-2013/... · CONFUSE: LLVM-based Code Obfuscation Chih-Fan Chen, cc3500@columbia.edu Theo los Petsios, tp2392@columbia.edu Marios

2 Related Work

Existing obfuscation methods and tools falltraditionally into one of the following cate-gories: layout, design, data and control ob-fuscation [4]. Layout obfuscation refers tothe process of modifying the layout of the soft-ware: deleting comments, changing the nameof variables, removing debugging informationetc. Design Obfuscation [5] refers to theeffort being made to obscure the design in-formation of the software. Such actions in-volve slitting and merging of classes etc. Dataobfuscation [4] techniques involve array andvariable splitting, conversion of data to proce-dures, changes in variables’ lifetime etc. Con-trol Flow Obfuscation (CFO) [6] techniquesuse abstract interpretation and opaque predi-cates to break up the program’s control flow.A very effective technique in this category isparallelization [4] and control flow flat-tening with history maintenance [7]. CFOcan also be achieved by hiding control flowinformation in the stack [8]. Various al-gorithms [9] [10] used in software containingself modifying code utilize dynamic ob-fuscation techniques. Recent works [11]propose a signals based approach: in thiscase the CFO is achieved through signals usedfor Inter Process Communication (IPC). Con-trol flow instructions are replaced with trap in-structions. When a trap instruction is raised,the respective signal triggers a signal handlerwhich invokes the restoring function which willtransfer the system control to the original tar-get address.

3 LLVM

For the purpose of our project we will usethe the LLVM[12] compiler framework withClang[13] as the front end. LLVM is one themost popular and frequently used compilerframeworks, largely because of the plethoraof its supported languages and architectures.

It utilizes the three phase design (front end -optimizer - back end) that provides flexibilityand modularity to the compiler and uses itsown Intermediate Representation (IR) as aninternal form of code presentation.

Figure 1: LLVM: Three Phase Design

As presented in figure 1, each front endtests the source code for errors and maps itto LLVM IR. Then a source and architectureindependent optimizer makes various transfor-mations on this IR in an effort to make execu-tion more efficient. Finally, depending on theprocessor family, a suitable back end translatesthe optimized IR to binary code. Clang is afront end that provides a unified parser for C-based languages and can be used for a varietyof purposes such as indexing, source analysis,refactoring etc.

In the following sections we will provide abrief description of the IR properties and ofthe optimizer’s architecture since Confuse willbe implemented as part of the optimizer.

3.1 LLVM IR

The aforesaid IR is a first class low-level lan-guage, with a small three-address-form virtualinstruction set and a potentially infinite num-ber of virtual registers. Unlike common as-sembly languages it provides some abstractionfrom the machine internals such as the ABI,even though it is strongly typed and uses la-bels. The IR can losslessly be transformed tothree distinct forms: the textual that can beunderstood easier by a human, an in mem-ory data structure used for the optimizations

2

Page 3: CONFUSE: LLVM-based Code Obfuscationaho/cs4115_Spring-2013/... · CONFUSE: LLVM-based Code Obfuscation Chih-Fan Chen, cc3500@columbia.edu Theo los Petsios, tp2392@columbia.edu Marios

and an efficient and dense bitcode binary for-mat. Finally we should note that the IR fa-cilitates the communication between the vari-ous compilation phases since it allows a com-plete representation of the source code withoutthe need to extract information from previ-ous phases thus making each phase completelyseparate.

3.2 Optimizer Architecture

The optimizer is structured as a series of dis-tinct passes which are built into archive li-braries as loosely coupled as possible (i.e. theyare intended to be completely independentotherwise they explicitly request analysis in-formation from other passes). Each pass iswritten in C++, derives the Pass class andproduces a modified version of the IR it readsthat will be used as input for the next one.

This scheme gives the user the ability tochoose which passes she prefers to utilize whenoptimizing a program by making use of thePassManager. The PassManager makes surethat all necessary passes are included (e.g.all analysis passes needed by an optimizationpass) and tries to set the passes in the orderthat will yield the optimal result.

The most important advantage of this de-sign is that a programmer that wishes to createa new optimization pass can decide the passesthat need to be executed before and after theexecution of his new pass. This dynamic lay-out forces minimal overhead to new optimiza-tions since one can use only suitable passes astheir prelude and postlude while the not usedones will not be linked to them. Figure 2 showsan example of how a new PassXYZ that doesnot require all the passes can only be pairedwith PassA and PassB (and not PassC etc) forefficiency.

Figure 2: Pass Selection Using PassManager

4 Data Obfuscation

4.1 String Obfuscation

Strings are one of the most expressive datatypes used in high-level languages and the con-trol flow of many applications relies on the out-come of string comparisons. Let us considersnippet 1:

if (strcmp(cmd , "add") == 0)

{

/* handle add */

}

else if (strcmp(cmd , "remove") == 0)

{

/* handle remove */

}

else if (strcmp(cmd , "edit") == 0)

{

/* handle edit */

}

else

{

/* error */

}

Code 1: String comparisons

It is trivial for someone to infer that thecode of snippet 1 is used for the modificationof entities (e.g. files). Comparisons like theone of snippet 1 are very common especiallyin command-line applications that handle usercommands. At the same time they providevaluable insight to the application’s program-ming logic that can lead to a more efficient and

3

Page 4: CONFUSE: LLVM-based Code Obfuscationaho/cs4115_Spring-2013/... · CONFUSE: LLVM-based Code Obfuscation Chih-Fan Chen, cc3500@columbia.edu Theo los Petsios, tp2392@columbia.edu Marios

accurate reverse-engineering.Confuse deals with this type of comparisons

by making use of cryptographic hash func-tions. Cryptographic one-way hash functionsare functions that map input from a large dataset to a bit string in a way that any change tothe input would result to a different bit-stringand that given the bit-string it is impossibleto compute the input. Examples of crypto-graphic hash functions include SHA-1 [14] andmd5[15].

We already know the possible (legitimate)values that the string can take, therefore wecan compute their hash values and then atrun-time use the hash function on the stringvariable and use its output for the compar-isons. The obfuscated version of snippet 1 canbe seen in snippet 2:

/* known a priori

Hash("add") = h1

Hash(" remove ") = h2

Hash("edit") = h3

*/

hash = Hash(cmd);

if (hash == h1)

{

/* handle add */

}

else if (hash == h2)

{

/* handle remove */

}

else if (hash == h3)

{

/* handle edit */

}

else

{

/* error */

}

Code 2: Obfuscated string comparisons

It is obvious that someone that tries to ana-lyze snippet 2 is unable to deduce its use bylooking at the predicates. Moreover in manycases after using this technique we can addmore cases and fill their body with junk bytesin order to force the reverse-engineer to waste

her time and resources in an effort to infer thebehavior of this branch.

4.2 Insertion Of Irrelevant Code

Another technique that is frequently used byobfuscating tools is the insertion of junk code.The purpose of this is to force mistaken con-clusions about the control flow of the programby adding unnecessary complexity (e.g. redun-dant operations) while simultaneously preserv-ing the semantics of the source code. Let usconsider snippet 3:

int x = 0;

x++;

return x;

Code 3: Original Code Sequence

A simplest way is to intersperse actual codewith some variables that never been used.This will make the code more complex, forcingthe reverse-engineer to waste time and effortsieving out the useless instructions. For ex-ample, the irrelevant variable y is introducedto code snippet 4. Since y is not used in theactual computation of x, we can introduce su-perfluous instructions on y without affectingthe original code. However, this kind of junkcode is easily found and will probably be elim-inate by code optimizer. We must use someother technique to insert junk code.

int x = 0, y;

x++;

y = 3;

return x;

Code 4: Obfuscated Code Sequence WithIrrelevant Variable

Instead of inserting the unused variable, wecan use some simple mechanisms to make thesequence significantly less visible. For exam-ple, in snippet 5 we added several mathemat-ical operations.

int x;

x++;

x = (3 * (x + 4) - 12 ) / 3;

4

Page 5: CONFUSE: LLVM-based Code Obfuscationaho/cs4115_Spring-2013/... · CONFUSE: LLVM-based Code Obfuscation Chih-Fan Chen, cc3500@columbia.edu Theo los Petsios, tp2392@columbia.edu Marios

return x;

Code 5: Obfuscated Code Sequence

This transformation is based on the factthat (b ∗ (t+ b)− b ∗ a)/b = (bt+ ba− ba)/b =bt/b = t and makes the code much less prone toautomated reverse-engineering since the disas-sembler cannot know mathematical identities.Another advantage of this obfuscation is thatwe do not have to add any new variable intothe original program. The two numbers, a andb, is added in the IR level by our obfuscationcode. Since a and b are not bounded, they canbe used to reproduce any value needed andhave no effect to the final result.

The main disadvantage of the aforemen-tioned techniques is the space/time overheadintroduced into the program as these super-fluous instructions are executed at run-time.To alleviate this, we can choose to insert ir-relevant code as the body of an unconditionalbranch that is never taken. In this case werefer to this as dead code that will not be exe-cuted at run time. The insertion of such deadcode is made more effective when used in con-junction with the insertion of opaque predi-cates. We will describe this in more detail inSection 5.3.

5 Control Flow

Obfuscation

5.1 Opaque Predicates& Variables

We implement control flow obfuscation usingopaque predicates and variables. In generalterms, an opaque variable V has some prop-erty q that is known a priori to the obfuscatorbut which is difficult for the deobfuscator todeduce. Similarly, an opaque predicate is acondition that has an outcome that the obfus-cator knows a priori but which is hard for thedeobfuscator to compute, since the static anal-

ysis of all branches has exponentially largercost.

Definition: A variable V is opaque if atany point p in a program, V has a property qwhich is known at obfuscation time. We de-note this property by Vp

q. A predicate P isopaque if its outcome is known at obfuscationtime. If P always evaluates to True at point p,we write Pp

T , whereas if P always evaluate toFalse we write Pp

F . When it is uncertain if Pwill evaluate to true or false, we write Pp

?[4]:

A predicate is considered trivial if a deobfus-cator can crack it with local static analysis,that is, by examining a small number of pre-vious commands. A predicate is consideredweak if a deobfuscator can crack it with globalstatic analysis. The following examples areillustrative of the previous:

int x=5;

int y=8;

int v=x/y;

if (v>1) {...}

Code 6: Trivial Opaque Predicate

int x=5;

<...> // multiple commands where x

remains unchanged

if (x>1) {...}

Code 7: Weak Opaque Predicate

5.2 Choosing the properpredicate

The basic criteria for evaluating the quality ofan obfuscating transformation are [16]:

5

Page 6: CONFUSE: LLVM-based Code Obfuscationaho/cs4115_Spring-2013/... · CONFUSE: LLVM-based Code Obfuscation Chih-Fan Chen, cc3500@columbia.edu Theo los Petsios, tp2392@columbia.edu Marios

• Potency, that is, how much obscurityan obfuscating transformation adds to theprogram

• Resilience, which refers to how difficultit is for an automatic deobfuscator tobreak the obfuscating transformation

• Stealth of Secrecy, which refers to howhard it is to tell if some part of code be-longs to the original program or the ob-fuscation part

• Overhead, which corresponds to the dif-ferences in the running time of the obfus-cated versus the normal program

5.2.1 Predicates based onmathematical identities

An easy way of producing opaque predicateswhich are always either true or false, is basedon number-theoretical identities. The follow-ing identities[17] hold for all values of x, y ∈ Z:

• 7y2 − 1 6= x2

• 3|(x3 − x)

• 2|x ∨ 8|(x2 − 1)

•2x−1∑

i=1, 2 6| ii = x2

The following hold for all values of x ∈ N

• 14|(3 ∗ 74x+2 + 5 ∗ 42x−1 − 5)

• 2|bx22c

If we initialize a variable a to a value Va atsome point in the execution of the program,then, say, the predicate Va|(x3 − x) will al-ways be true if Va is 3 and false if Va is not amultiple of 3. This category of opaque predi-cates is easy to produce but not very easy tobreak, though in general they are the strongestof weak predicates, granted that the optionsfor choosing the opaque predicate are limited,and thus, given time and effort, one can brute-force all possible options at runtime.

5.3 Transformations

5.3.1 Insertion Transformation

Consider the following figure [4]:

Consider the basic block S = S1;S2; ...Sn.We insert an opaque predicate P T into S, split-ting it in half (a). The predicate contains ir-relevant code since it will always evaluate toTrue. In figure (b) we split it into two halvesand then proceed to create two different ob-fuscated versions Sα, Sb, using different setsof transformations on each half and we selectbetween them at runtime. Finally in (c) weintroduce a bug to Sb and always select Sα.

5.3.2 Loop Condition InsertionTransformation

Consider the following figure [4]:

In this example we obfuscate a loop by mak-ing the termination condition more complex,adding a predicate that does not modify thenumber of times the loop gets executed.

The efficiency of the aforesaid methods canimprove significantly if we use it in conjunc-tion with the insertion of junk bytes as thecode to be executed in the (a priori known)

6

Page 7: CONFUSE: LLVM-based Code Obfuscationaho/cs4115_Spring-2013/... · CONFUSE: LLVM-based Code Obfuscation Chih-Fan Chen, cc3500@columbia.edu Theo los Petsios, tp2392@columbia.edu Marios

non taken branch of an opaque predicate. Adeobfuscator in this case must evaluate bothbranches (since it cannot predict which one istaken) and therefore this will lead to the inef-fective use of time and resources and possiblyto mistaken conclusions.

6 Implementation

Our obfuscator will be implemented in C++since it is fully supported by Clang andwill modify C files using the techniques ex-plained above. This decision was based onthe fact that C is one of the most popularand frequently used programming languagesfor applications that are prone to be reverse-engineered and thus its obfuscation would bea valuable contribution to the community.

Confuse will consist of a series of optimiza-tion (i.e. obfuscation) passes that will trans-form the LLVM IR produced by Clang beforeit is translated to binary code suitable to theprocessor family.

We note that the modifications applied bythese passes depend on the source file given asinput. For example, if there is no string com-parison in the source code the string transfor-mation pass will not make any change. How-ever, since the opaque predicates mechanismdoes not rely on such assumptions, Confusecan employ this transformation on any inputsource code.

7 Testing

While our code obfuscation techniques aim totransform the input program to a form moreresistant to reverse-engineering efforts, theymust ensure that the original semantics of theprogram are preserved.

Being very extensive, the LLVM frameworkhas a very robust and comprehensive testinginfrastructure. It has two major categories oftests, namely whole programs and regression

tests. The whole programs are contained intest-suites which can be compiled using spe-cific compiler flags (to be tested). The outputof the compiled programs are then comparedto some reference output to verify the correct-ness of the LLVM functions. The regressiontests are small unit tests meant to test specificfunctionality.

We leverage on the LLVM testing infrastruc-ture to conduct our testing of our obfuscationpass libraries. We design both whole programsand regression tests to be executed with ourobfuscation passes to verify that the transfor-mations have not altered original behavior ofinput programs. The testing is mainly drivenby the LLVM Integrated Tester tool, lit, thatwe describe in the next sub-section.

Determining the correctness of a programafter program transformation is difficult. Tothis end, we choose to verify that the semanticsof the input program are preserved by compar-ing the output of a transformed program witha known reference output. We will use thisstrategy to do incremental testing of our ob-fuscation functionalities.

7.1 lit - LLVM Integrated Tester

Shipped together with the LLVM framework,lit is a portable tool used for executing LLVMand Clang regression and unit tests. It pro-vides a convenient way for developers of LLVMand Clang libraries to easily write and run testfiles during the development process. lit is de-signed to be a multi-threaded lightweight toolfor executing and reporting the success of theregression tests.

7.2 Example Regression Test

A regression test file can be any valid input file(such as C file, bitcode file, LLVM IR file) thatis accepted by any LLVM frontend tool. Sinceour code obfuscation targets C source files, wewill design the test files to be C source files,

7

Page 8: CONFUSE: LLVM-based Code Obfuscationaho/cs4115_Spring-2013/... · CONFUSE: LLVM-based Code Obfuscation Chih-Fan Chen, cc3500@columbia.edu Theo los Petsios, tp2392@columbia.edu Marios

instrumented with lit-specific commands forthe test verification.

Each test file must contain comment linesstarting ”RUN:” that instruct lit how to ex-ecute this file. These comment lines containshell commands that when executed determineif the test is successful or not. lit will reporta failure if the verification commands returnfalse.

An example of a regression test file is as fol-lows:

// RUN: clang -emit -llvm %s -c -o -|

opt -load lib.dylib -objunk > %t1

// RUN: lli %t1 > %t2

// RUN: diff %t2 %s.out

#include <stdio.h>

void f_div(int i, int *k)

{

int j = 2;

*k = i / j;

}

int main()

{

int k;

f_div(8, &k);

printf("%d\n", k);

return 0;

}

Code 8: Simple regression test

This regression test is designed to verify thatthe following functionalities are preserved af-ter code obfuscation: (1) Simple printing ofstrings, (2) Integer division, (3) Parameterspassed by-value and by-ref. The set of RUNcommands compiles this C file into a LLVMbitcode file, transforms this bitcode file usingour obfuscation pass (specifically junk instruc-tions insertion), executes this obfuscated bit-code file and verifies that the output is 4. Ifthe output is not what we expect, the diff com-mand will return false, and lit will report atest failure.

7.3 Test Design

We have structured the test suite to consist ofboth set of tests specific to the individual ob-fuscation passes, and those that test the cor-rect functioning combining multiple obfusca-tion passes together. We detail the tests spe-cific to each obfuscation pass below.

The string obfuscation targets the hiding ofstrings used in string comparisons. We designthe tests by changing the various orders of theparameters we pass into the string comparefunction. Besides using strings as the actualparameters, we also include tests that passin const string buffers which should be obfus-cated by our string obfuscation pass. We alsouse variants of string comparison functions likestrcmp() and strncmp().

The junk-code obfuscation involves trans-forming arithmetic assignment instructionsinto more convoluted but semantically equiv-alent instructions. Naturally, we design thetests that make use of arithmetic assignmentin a range of situations. This includes usingthe assignments with different arithmetic op-erators in while loops, for loops and plain se-quence of arithmetic operations.

The control-flow obfuscation using opaquepredicates is much more flexible compared tothe above two types of obfuscation as it canpractically be used on any kind of code. Inview of this, we reuse the tests from the othertwo types of obfuscations.

8 Results

In this section we will present brief results forevery obfuscation technique we implementedin an effort to showcase some of the advantagesobtained in the obfuscated file.

8.1 String Transformation

The main advantage of this technique ispresented in the respective section, however

8

Page 9: CONFUSE: LLVM-based Code Obfuscationaho/cs4115_Spring-2013/... · CONFUSE: LLVM-based Code Obfuscation Chih-Fan Chen, cc3500@columbia.edu Theo los Petsios, tp2392@columbia.edu Marios

there is a valuable side effect that can occur ina number of occasions. Let us assume snippet9:

char a[] = "Should be visible after

obfuscation";

const char b[] = "Should NOT be

visible after obfuscation";

if (strcmp(a, b) == 0) {

/* code that does not use b */

}

/* code that does not use b */

Code 9: Example Snippet

In this case b can be eliminated entirelyfrom the source code since its only use is in thestring comparison. As a result, someone thattries to disassemble the binary will not know itever existed thus making his effort even harder.

In order to showcase this behavior wewill use the program strings that returnsall printable strings in a binary and ask itto return all printable strings of at least 10characters.

Figure 3: Printed message from unobfuscated(unmodified) executable

Figure 4: Printed message from obfuscatedversion of executable

As we can see the string is visible in theunobfuscated file whereas after the pass it isremoved.

8.2 Junk Code Insertion

In general, we should add some instructionsbefore a variable is stored, so we targeted onstore instructions in junk code insertion. How-ever, we do not have to insert junk code in ev-ery store instructions because it will not onlyslow the speed of programs but also easilybe detected by reverse-engineer. To ease theproblem, we focus only on integer, which ismost widely used in arithmetic, for loop andwhile loop. The way to insert junk code aremany. In this paper, we shows three differentways to insert the junk code.

1. d ∗ (x + e) = d ∗ x + d ∗ e

2. (ax + b)− (cx + b) = (a− c) ∗ x

3. (ax + b)− ((a− 1)x + c) = x + (b− c)

The original three address code of snippet 3is shown in Fig. 5. The result of each methodis shown in Fig. 6(a), Fig. 6(b), and Fig.6(c),respectively. Note that we do not dealwith variable initialization (e.g., instruction 2& 3 in Fig. 5) because its instruction is of-ten at the beginning of the function, whichmakes obfuscation very distinct from the orig-inal code. For clarification, the junk codewe inserted are those instructions begin with%temp. As we can see in Fig. 6(a) - 6(c), weadded about four or five instructions.

Figure 5: Original Three Address Code

9

Page 10: CONFUSE: LLVM-based Code Obfuscationaho/cs4115_Spring-2013/... · CONFUSE: LLVM-based Code Obfuscation Chih-Fan Chen, cc3500@columbia.edu Theo los Petsios, tp2392@columbia.edu Marios

(a) Type1 insertion

(b) Type2 insertion

(c) Type3 insertion

(d) Inserting junk code multiple times

In Fig. 6(d), we use Type2 insertion follow-ing by Type0 insertion, the junk code becomesmore complicated than the original code.

8.3 Control Flow Obfuscation

We successfully managed to alter the controlflow of the binary without altering its seman-tics, using the methods described section 5.We picked random spots to insert the opaquepredicates inside the binary. The variablesused as arguments in the equations formingthe predicates are placed in such positions inthe binary so that it won’t not be easy foran adversary to infer that they are used inan opaque predicate with local static analysis.The insertion of the opaque predicate alterssignificantly the control flow graph, as it canbe seen in Figures 6 & 7. These figures de-pict the connections between basic blocks fora simple if-statement, before and after obfus-cation. It is clear that the flow graph after theobfuscation is more complex.

Figure 6: CFG outline before obfuscation

Figure 7: CFG outline after obfuscation

In order to evaluate the change in the com-plexity of the control flow graph as a result

10

Page 11: CONFUSE: LLVM-based Code Obfuscationaho/cs4115_Spring-2013/... · CONFUSE: LLVM-based Code Obfuscation Chih-Fan Chen, cc3500@columbia.edu Theo los Petsios, tp2392@columbia.edu Marios

of our obfuscation, we define the cyclomaticcomplexity (CC) of a graph as follows[18]:For a graph G=(V,E) which consists of P con-nected components, the cyclomatic complexityM is

M = E − V + 2P

To measure the cyclomatic complexity ofany given program, we wrote a Python scriptthat runs on the IDA Pro Disassembler to com-pute the CC score from the disassembled pro-gram. Thus the CC is highly dependent onhow well the IDA Pro Disassembler can disas-semble a given program. This is a reasonableassumption since it approximates the effort ofa reverse-engineer attempting to derive the se-mantics of a program using this disassembler.

In Figure 8 we present some results fromthe change in the cyclomatic complexity of testprograms used in the evaluation phase of Con-fuse. We notice that the increase in the CCvaries from 40% to 225%, depending on theoperations in the binary.

Figure 8: Cyclomatic Complexity Change

9 Conclusion

Confuse takes advantage of the LLVM IR andobfuscate the source code at compilation time.This is a very flexible scheme since it removesthe unnecessary steps of compiling and then

running an obfuscation application to the pro-duced binary or assembly code. The imple-mentation of the obfuscation in the form ofoptimization pass libraries allows users selectand pick the obfuscation they want to use. Itis also very scalable. More importantly, it al-lows our implementation to be universal for allthe architectures and processor families sup-ported by LLVM because we modify the LLVMIR and not the (architecture specific) binarycode. To the best of our knowledge we are thefirst to create LLVM-based obfuscation tool forthe aforementioned techniques. Balachandranet al [8] obfuscate at link time using C/C++binary programs. Liem et al [7] use an front-end by Edison Design Group [19] and use Fab-ric++ as the intermediate representation. Fi-nally the most similar work to ours is by Sharifet al [20] who have also used the LLVM IR, butemployed different obfuscation techniques.

References

[1] Wikipedia. Reverse engineering, Decem-ber 2012.

[2] Chris Eagle. The IDA Pro Book: The Un-official Guide to the World’s Most Popu-lar Disassembler. No Starch Press, SanFrancisco, CA, USA, 2008.

[3] E. N. Dolgova and A. V. Chernov. Auto-matic reconstruction of data types in thedecompilation problem. Program. Com-put. Softw., 35(2):105–119, March 2009.

[4] Christian Collberg, Clark Thomborson,and Douglas Low. A taxonomy of obfus-cating transformations. Technical Report148, Department of Computer Science,University of Auckland, New Zealand,July 1997.

[5] Mikhail Sosonkin, Gleb Naumovich, andNasir Memon. Obfuscation of design in-tent in object-oriented applications. In

11

Page 12: CONFUSE: LLVM-based Code Obfuscationaho/cs4115_Spring-2013/... · CONFUSE: LLVM-based Code Obfuscation Chih-Fan Chen, cc3500@columbia.edu Theo los Petsios, tp2392@columbia.edu Marios

In DRM 03: Proceedings of the 3rd ACMworkshop on Digital rights management,pages 142–153. ACM Press, 2003.

[6] Mila Dalla Preda and Roberto Gia-cobazzi. Control code obfuscation byabstract interpretation. In In Proc.32nd ICALP, LNCS 3580, pages 301–310.IEEE Computer Society, 2005.

[7] Clifford Liem, Yuan Xiang Gu, andHarold Johnson. A compiler-based in-frastructure for software-protection. InPLAS, pages 33–44, 2008.

[8] V. Balachandran and S. Emmanuel. Soft-ware code obfuscation by hiding controlflow information in stack. In Informa-tion Forensics and Security (WIFS), 2011IEEE International Workshop on, pages 1–6, 29 2011-dec. 2 2011.

[9] Liang Shan and Sabu Emmanuel. Mo-bile agent protection with self-modifyingcode. J. Signal Process. Syst., 65(1):105–116, October 2011.

[10] S.M. Darwish, S.K. Guirguis, and M.S.Zalat. Stealthy code obfuscation tech-nique for software security. In ComputerEngineering and Systems (ICCES), 2010International Conference on, pages 93 –99, 30 2010-dec. 2 2010.

[11] Igor V. Popov, Saumya K. Debray, andGregory R. Andrews. Binary obfusca-tion using signals. In Proceedings of16th USENIX Security Symposium onUSENIX Security Symposium, Berkeley,CA, USA, 2007. USENIX Association.

[12] Chris Lattner and Vikram Adve. LLVM:A Compilation Framework for LifelongProgram Analysis & Transformation. InProceedings of the 2004 InternationalSymposium on Code Generation and Op-timization (CGO’04), Palo Alto, Califor-nia, Mar 2004.

[13] Clang/LLVM Maturity Report,Moltkestr. 30, 76133 Karlsruhe- Germany, June 2010. Seehttp://www.iwi.hs-karlsruhe.de.

[14] D. Eastlake and P. Jones. US Secure HashAlgorithm 1 (SHA1). RFC 3174, 9 2001.

[15] R. Rivest. The md5 message-digest algo-rithm. RFC 1321, 1992.

[16] Christian Collberg, Clark Thomborson,and Douglas Low. Manufacturing cheap,resilient, and stealthy opaque constructs.In IN PRINCIPLES OF PROGRAM-MING LANGUAGES 1998, POPL98,pages 184–196, 1998.

[17] Genevive Arboit. A method for water-marking java programs via opaque pred-icates. In In Proc. Int. Conf. ElectronicCommerce Research (ICECR-5, 2002.

[18] Wikipedia. Cyclomatic complexity, De-cember 2012.

[19] http://www.edg.com/.

[20] Monirul Sharif, Andrea Lanzi, JonathonGiffin, and Wenke Lee. Impeding mal-ware analysis using conditional code ob-fuscation. Informatica, 2008.

12

Page 13: CONFUSE: LLVM-based Code Obfuscationaho/cs4115_Spring-2013/... · CONFUSE: LLVM-based Code Obfuscation Chih-Fan Chen, cc3500@columbia.edu Theo los Petsios, tp2392@columbia.edu Marios

A Appendix

A.1 Usage

We have designed each obfuscation techniqueto be in the form of an optimization passwithin one single library Obfuscation.so. Eachobfuscation technique can be invoked as apass by passing in its corresponding flag-obstring, -objunk or -opaque predicates. Tomake a turn-key solution to transform andcompile a given .c source file to an obfuscatedbinary executable, we wrote a Makefile thatallows us to invoke specific obfuscation, orapply all the obfuscation passes together.

In the confuser directory, use the followingcommand to perform individual obfuscationon a given .c source file, with filenametest.c. This will create two sets of binary andreadable LLVM IR listing, one each for theoriginal code and obfuscated code. The filescan be found in the directories original andconfused respectively. In this way, we caneasily compare the differences between theoriginal code and the generated obfuscatedversion.

make [obstring|obop|objunk] TARGET=test

To use all the obfuscation passes on a givenfile, use the following command:

make all TARGET=test

In Section 8.3, we describe using the CCscore as a means to measure the approximatecomplexity of the control flow graph of aprogram. We augmented the Makefile toperform the disassembly and CC computationof the programs automatically by invokingthe IDA Pro Disassembler and the Pythonscript. This may take a while if the programis large. We can perform this CC evaluationby using the following command. The CCscore is computed for each function of agiven program. The average, minimum, andmaximum CC scores of the functions in the

program will be displayed for both the originaland obfuscated versions.

make evalcc TARGET=test

A.2 Lessons Learnt

A.2.1 Chih-Fan Chen

I think the most important thing that I learnedis not the context of obfuscation but how towork with other people. Before the project, Iknow nothing about obfuscation. In the be-ginning, I was afraid that I cannot catch upwith the team. I am glad that I have verygood mentor and nice teammates to help methrough. Though this project, I learned howto use the LLVM to read, create, and changethe instructions. I know more about how thecomplier works and how we can use some tech-nique to obfuscate others from reassemble thecode. By comparing the project and the ma-terials learned from class, I think it is betterthan learning from instructors speech.

A.2.2 Theofilos Petsios

Through this project I had the chance to havea greater understanding of the inner parts of acompiler, the optimization phases and the con-siderations that have to be taken into accountfor maintaining the semantics any source codethat gets compiled. It was made clear to mehow important regression testing is in softwareengineering. I also learned a lot about LLVMand how to deliver a project in steps, beingpart of a small team, and also about how topresent your results and ask questions on thepros and cons of your software. Overall, theguidance from the teaching staff and my team-mates was a great aid. This project was a lotof fun and I believe we all have build a goodfoundation to continue this work in a researchlevel.

13

Page 14: CONFUSE: LLVM-based Code Obfuscationaho/cs4115_Spring-2013/... · CONFUSE: LLVM-based Code Obfuscation Chih-Fan Chen, cc3500@columbia.edu Theo los Petsios, tp2392@columbia.edu Marios

A.2.3 Marios Pomonis

This project gave me the opportunity to re-alize the importance of the unseen hero ofthe compilation process: the IR. Before westarted I considered having a strong back-endthe most crucial part of the compiler how-ever this changed drastically because I realizedthat without a well designed IR that will alloweffective transformation passes, the back-endwill only be able to make minor optimizationwhen producing assembly or the machine code.

I was also given the chance to study the in-ternals of the LLVM framework since I hadnever bothered doing it in the past. A futuretask that I have set for myself is to do thesame for the GCC something that illustratesmy amazement about those that I learned inthis process.

Last but not least, I was able to verify onceagain that given a hard-working and talentedcore of teammates any task becomes exponen-tially easier and simpler, especially if teamchemistry is there. Guidance and supportfrom the teaching staff is also vital. Since thiswas my first team project in Columbia, I be-lieve it is worth sharing.

A.2.4 Adrian Tang

Through this project, I have observed that theIR level seems to be an appropriate granular-ity to obfuscate a program given the availabil-ity of the source code. Working at this phasehas a good balance of being high-level enoughto have enough language semantic encoded inthe IR to perform meaningful obfuscation andyet low-level enough to ensure the obfuscationpersists to the target machine code. I havelearnt a tremendous amount ranging from thetechnical bits like the inner workings of LLVMand dissecting a program with disassembler, tothe general aspects of managing and collabo-rating in a project team. The learning curvewas steep but a great deal of fun. I also appre-ciate all the support and advice the teaching

staff has given us. I believe this project hasgiven us some foundation in the workings ofthe compiler that will certainly help us in thefurther pursuit of research in this area.

A.3 Future Works

Code obfuscation is still an active researchtopic of both academic and industry interest.While creating Confuse we were given the op-portunity to familiarize ourselves with manytechniques proposed in the bibliography. Tothe best of our knowledge only a small por-tion of the researchers implements their ob-fuscation in the optimizer level which in ouropinion yields significant benefits (with porta-bility and flexibility being the keywords in thiscase), thus we believe that if Confuse were tobe continued (as a research project) and be en-hanced with a novel technique that would takeadvantage of the its design it might produce aresearch paper.

Given the scope of the project, we regret notbeing able to conduct a more comprehensiveevaluation of the overhead introduced to anobfuscated binary. This is certainly an areathat we will look into for any future work.

14