Devietti 1
Joseph Devietti (devietti@seas)Adviser: Dr. Milo Martin (milom@cis)11 April 2006
Safe at Any Speed: Accelerating the CCured Programming Language
AbstractThe power and efficiency of the C programming language have historically made it the
language of choice for writing lowlevel system and performancecritical code. Explicit memorymanagement gives the programmer extensive control over how a program’s data will be laid outin memory, cutting away layers of abstraction that can sap performance. Of course, C ishistorically just as well known for the myriad security flaws that such lowlevel power hasengendered. Buffer overruns (and related attacks) are constantly being discovered in realsystems currently in production use. Yet such caveats have not diminished the importance of theC language, nor its widespread use. Several approaches have been proposed to ameliorate thissituation. This project takes as its foundation the CCured system developed by Necula et al.1
CCured is a dialect of C that retains most of C’s expressivity – thus requiring little change tolegacy code – while adding typesafety. CCured uses “annotated” pointers (see TechnicalDiscussion, below), allowing the memorysafe usage of pointers (as one finds in C# or Java) tobe enforced, at the cost of performing some runtime checks, increased memory usage, and theuse of a conservative garbage collector2 to prevent dangling references.
My project is twofold. First, it involves specifically characterizing the overhead thatCCured entails, in terms of increased running time, dynamic instructions and memory references.Secondly, I will discuss trying to alleviate those overheads by adding some additional structuresto a modern microprocessor.
Related WorkThe primary related work is the CCured system itself. Though the concept of adding
annotations to C pointers has a rich history, CCured is one of the latest and most promising ofsuch systems. The idea of annotations originated with Austin et al.3 Numerous relatedimplementations exist, e.g. Jones and Kelly4 use a separate tree structure to store annotations(instead of bundling them with the pointers themselves, as Austin et al. and CCured do). Somecompilers have also implemented C security features, notably Steffen’s rtcc compiler,5 whichadds array bounds checking, and the C Range Error Detector (CRED)6 extension for gcc, whichperforms bounds checking for strings. CCured’s main contribution is its aggressive staticoptimization of checks based on type inference, which allows it to rigorously enforce memorysafety – CCured eliminates both spatial violations, such as walking off the end of an array, and
Devietti 2
temporal violations, such as accessing memory that has been previously free’d – withrelatively low overhead compared to its predecessors. CCured is also designed with legacysupport in mind; the porting of insecure code to CCured (a process known as “curing”) is highlyautomated.
In the realm of computer architecture, various proposals have been made to use hardwarestructures to provide security with minimal overhead. Xu et al proposed hardware protection offunction return addresses on the stack.7 More involved proposals have been put forth as well,such as using the Intel x86 architecture’s segmented memory features to provide boundschecking for arrays,8 and an even more elaborate segmented memory system that couples verygranular protection with an aggressive caching scheme for high performance (in which I havefound much of the inspiration for my project, as discussed later).9 The disadvantages of targetinga specific kind of attack are clear; as an overview of bufferoverrunstyle attacks10 shows,“smashing” return addresses on the stack to arbitrarily redirect control flow is just one of avariety of attacks that have developed over the years. General memory safety will provide morerobust defense in the face of evolving attack sophistication. But using a virtual memorymechanism to achieve memory safety, while sufficient, is arguably more involved thannecessary. There seems to be room for a more balanced approach, providing a fastimplementation of memory safety without invoking the full mechanism of virtual memory.
Technical DiscussionCCured’s Pointers
The first phase of my project seeks to answer the following questions: What are thegeneral contributors to CCured’s performance overhead? What kinds of CCured pointerannotations are most frequent at runtime? Are these results robust across benchmark inputs, andacross different benchmarks? How much does each kind of runtime check and annotationcontribute to performance slowdown, both in terms of the running time of benchmarks and theirmemory usage? The second phase of this project entails an evaluation of various hardwaretechniques for reducing CCured’s performance overhead.
CCured has four main kinds of annotated pointers: SAFE, SEQuence, ForwardSEQuenceand WILD. A diagram of the memory layout of each of these kinds of pointers is given in Figure1. Pointers that are classified as SAFE are those which are never subject to pointer arithmetic,and for which CCured can statically infer typesafety (that is, they are not subject to any badcasts between, say, an integer and a pointer). SAFE pointers require a nullcheck on dereference,and nothing more. Pointers that are statically typesafe but have pointer arithmetic performed onthem are classified as SEQ, short for “sequence,” pointers.
A SEQ pointer has base and end information encoded as two subsequently adjacent
Devietti 3
pointers and, on every dereference, requiresverifying that the memory location referredto is within its upper and lower bounds.FSEQ pointers are an optimization of SEQpointers, leveraging the fact that virtually allpointer arithmetic is positive, and thusstoring a base value for a pointer is notstrictly necessary. FSEQ pointers have onlyend information encoded as the subsequentpointer; this end value is checked upondereference. However, a check must beperformed for every pointer arithmeticoperation to ensure that the pointer valuealways increases and therefore will notoverflow its machine representation andallow access to a memory location belowthe pointer’s initial target.
Finally, pointers for which typesafety cannot be inferred, e.g. in cases whenarbitrary casting is used, are classified asWILD. WILD pointers have a twopointerrepresentation: the pointer itself and a basepointer that points to a bounded region ofmemory (the length of which is encoded justbefore the target of the base pointer) whichincludes metadata tag bits that identify a possible additional WILD pointer to which the originalWILD pointer points. The manipulation and verification of tag bits adds a high overhead to theuse of WILD pointers, but such pointers can usually be avoided in practice by modifying slightlythe original program.
BenchmarksAnswering the performance questions outlined above required focusing on CCured’s run
time checks, specifically, the bounds checking that it performs for SEQ pointers. Thus far, Ihave been able to successfully “cure” three SPECINT2000 benchmarks:11 bzip2 (streamcompression), vpr (circuit placement) and mcf (combinatorial optimization), and fivebenchmarks from the MIBench embedded benchmark suite:12 susan (image recognition),
Figure 1: CCured pointer memory layouts
Devietti 4
dijkstra (graph traversal), blowfish (symmetric blockcipher), sha (secure hash algorithm), and gsm (voiceencoding). These benchmarks vary substantially in staticcode size, as Table 1 shows. A benchmark’s code size is agood heuristic for how difficult curing it will be; in somecases nothing at all is needed, a few casts need to be madeexplicit, or function prototypes need to be included so thatCCured can verify type safety. Sometimes a securityviolation in the original code has to be fixed, as was the casewith blowfish – an array of size 8 was accessed up toindex 31 – and in bzip2 an array in the test suite wasoverflowed by about 230KB.
For more complex benchmarks, CCured’scomplexity can become overwhelming. In the gzip
benchmark, a member of the SPECINT2000 suite that I was not able to cure successfully,CCured’s garbage collector stops allocating memory before the benchmark can finish running. Ihave been unsuccessful so far in tracking down the source of the leak, which only appears whenrunning the cured code; the benchmark runs fine without CCured, and also runs fine usingCCured’s garbage collector by itself, as a replacement for C’s explicit memory management.
Time OverheadCCured has two forms of overhead – the time of extra bounds checking, and a space
Figure 2: Running times
bzip2 – graphic
bzip2 – program
bzip2 – source
mcf vpr susan dijkstra blowfish
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
2.2Running Time All checks
No null checks
No bounds checks
No runtime checks
benchmark
norm
aliz
ed r
untim
e (n
on-c
ured
=1)
Benchmark Lines of Code
bzip2 4641
vpr 16973
mcf 1915
susan 2122
dijkstra 350
blowfish 1658
sha 2984
gsm 5473
Table 1: Benchmark sizes
Devietti 5
overhead of larger pointer representations. Iexamine the time overhead first. As Figure 2demonstrates, CCured’s runtime overhead hoversaround 20%, with some exceptions, such assusan and the pathological cases of sha andgsm shown in Figure 3. Turning off various ofCCured’s runtime checks reveals that NULLchecks contribute comparatively little to runtimeoverhead. The source of the time overhead liesinstead with bounds checking. Disabling boundschecking in almost all cases achieves a noticeableincrease over running with all checks enabled,except for the mcf benchmark, where CCured’sspace overhead (instead of just its time overhead)seems to dominate. CCured’s wider pointerrepresentations mean that, even if the pointermetadata is not being accessed (when all runtime
checks are disabled), it still takes up space in the cache, and thus lowers the cache’s efficiency.This effect is onlyobservable forbenchmarks thatuse a large numberof pointers, likemcf.
Runtime ChecksCourtesy of the
80/20 rule, whichstates that roughly80% of a program’srunning time is dueto roughly 20% ofits code, I expectedsignificant portionsof CCured’soverhead to lie in
Figure 4: Dynamic check counts
bzip2 – graphic
bzip2 – source
bzip2 – program
vpr
mcf
susan
dijkstra
blowfish
sha
gsm
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
CCured Dynamic Check Counts
OTHER
SEQ2FSEQ
NULL
FSEQARITH
GEU
SEQ2SAFE
FSEQ2SAFE
benchmark
% o
f tot
al d
ynam
ic c
heck
s
Figure 3: Pathological benchmarks
sha gsm
0.000.501.001.502.002.503.003.504.004.505.005.506.006.507.00
Running Time
All checks
No null checks
No bounds checks
No runtime checks
benchmark
norm
aliz
ed r
untim
e (n
on-c
ured
=1)
Devietti 6
just a few key areas, i.e. just a few kinds of runtime checks would contribute to the vast majorityof total checks. This is very much the case, as Figure 4 shows. The six checks shown, out of the
23 checks that CCured provides, account for virtually 100% of CCured’s runtime checks acrossthe eight benchmarks. While the distribution of checks varies considerably across thebenchmarks, (and even, in the case of bzip2, across different inputs to the same benchmark) the“solution space” of checks to optimize is promisingly small. The NULL check in particular isprobably not worth implementing in hardware, given the small impact it has on runtimeoverhead. The other checks therefore comprise the bulk of CCured’s ~20% runtime overhead,for this set of benchmark programs, and therefore warrant the most optimization effort.Moreover, there is some similarity to be exploited between the five remaining checks. Table 2
Check C code x86 assembly
SEQ2FSEQ if (ptr < base) { fail(); } mov 0xC(%esp),%base
mov 0x8(%esp),%ptr
cmp %ptr,%base
jnae fail
FSEQARITH if (ptr + x < ptr) { fail(); } mov 0x8(%esp),%ptr
mov 0x8(%esp),%ptr2
add x,%ptr2
cmp %ptr2,%ptr
jnae fail
GEU if ((unsigned)a < (unsigned)b) { fail(); } same as SEQ2FSEQ
SEQ2SAFE if((unsigned)ptrbase >= (unsigned)endbase){ fail(); }
mov 0xC(%esp),%base
mov 0x10(%esp),%end
mov 0x8(%esp),%ptr
sub %end,%base
sub %base,%ptr
cmp %base,%ptr
jae fail
FSEQ2SAFE if (ptr >= end) { fail(); } mov 0x10(%esp),%end
mov 0x8(%esp),%ptr
cmp %end,%ptr
jae fail
Table 2: Runtime checks in C and assembly (destination registers underlined)
Devietti 7
makes these synergies more explicit; the C code and the assembly generated by gcc (with O3optimizations enabled) show just how similar these checks are.1 For instance, the SEQ2FSEQcheck, which constitutes less than 2% of the checks in vpr, and far less than 1% in every otherbenchmark, seems like a corner case, but would probably be worth optimizing as it is just thelower bound check of SEQ2SAFE. Similarly, GEU (greaterthanorequalto, unsigned) requiresan unsigned subtraction to be performed, and the jae or jnae instructions branch based on theresult of cmp’s unsigned subtraction of its operands, so the functionality of GEU is subsumed by
other checks.
Memory OverheadFigure 5 presents the
aggregate memory accesses (loadsand stores) generated by the eightbenchmarks, with all runtime checksenabled – the increased number ofmemory references due to widerpointer representations is an essentialcomponent of CCured, and sosimulating CCured’s memoryperformance with checks disabled isnot realistic. In most cases, CCureddoes not generate comparativelymany more memory accesses thannoncured code. The exceptions are
the pathological cases of sha and susan, shown inFigure 6, where there are substantially more memoryaccesses than in the noncured code. This is especially sowith sha, which requires almost an order of magnitudemore references; susan requires many more stores, butonly about 3 times as many references in general. (Thebenchmark gsm is not shown, because it created spurioussegmentation faults when running under cache simulationthat I was unable to fix).
1 The x86 cmp instruction performs regular subtraction, but does not actually store the result anywhere. Thecondition codes are set just as if a regular subtraction were performed. The jae instruction branches if a carrywas required for unsigned subtraction; jnae branches if a carry was not required.
Figure 6: Memory references pathological cases
sha susan
0
10
20
30
40
50
60Memory References
refs
loads
stores
benchmark
norm
aliz
ed m
emor
y re
fere
nces
(no
n-cu
red=
1)
Figure 5: Memory references
bzip2 – graphic
bzip2 – source
bzip2 – pro-gram
mcf vpr dijkstra blowfish
0
0.25
0.5
0.75
1
1.25
1.5
1.75
2
Memory Referencesrefs
loads
stores
benchmark
norm
aliz
ed m
emor
y re
fere
nces
(no
n-cu
red=
1)
Devietti 8
CCured’s increased memory traffic is mostly unavoidable – given the assembly languageversions of the runtime checks, presented previously, the mov instructions that load frommemory are on the critical path. Memory simulation confirms that CCured adds more loadinstructions to a program than store instructions, which is the result of reading pointer metadataalong with the pointer itself. While these loads can be somewhat parallelized (in the case of SEQpointers, and given a cache with sufficient read ports), they cannot be elided.
Hardware ProposalOur hardware proposal is aimed at reducing the raw instruction overhead of CCured, as
well as some of the effects of its increased usage of memory bandwidth. The central idea is tocache pointers and their metadata (base and end pointers) in a “shadow” register file that mirrorsthe architectural register file. While each architectural register is 32bits in size (for oursimulated architecture), each shadow register is in fact three registers – a pointer register, whichcorresponds to the architectural register; a base register, which holds the base pointer for thevalue in the pointer register; and, similarly, an end register.
For the sake of simplicity and ease of implementation, when running CCured undersimulation FSEQ pointers were disabled. Thus, of the five runtime checks listed in Table 2, onlySEQ2SAFE is actually simulated. In the absence of FSEQ pointers, SEQ2FSEQ no longerexists, and FSEQARITH and FSEQ2SAFE are translated into SEQ2SAFE checks. The GEUcheck is not implemented because its operands are not necessarily pointers and it is thereforedifficult to, in the general case, accelerate its execution using our proposed hardware.
With pointer values now stored in a fast parallelaccess structure, performing a runtimecheck can operate on a single shadow register instead of three separate architectural registers. Wemodel the expense of a runtime check under simulation as a single instruction – one couldimagine it as something like bounds %eax,2 implicitly operating on the shadowed copy of eax.
The shadow register file is best thought of as a cache – it does not always hold correctpointer metadata. If correct data is in a shadow register, then a bounds check can be executed inone instruction, reducing both the consumption of execution resources in the processor and theusage of memory bandwidth. If the shadow register is out of date, then the bounds check has torevert to its regular, serialized execution.
2 Interestingly, the x86 ISA already has a boundschecking instruction (named BOUND), but itssomewhat odd semantics take the value in a register as an integer index into an array (not apointer to a memory location), and a value in memory that contains a packed integerrepresentation of the minimum and maximum permissible indices, and checks for a violation.Due to presumably general lack of use, the BOUND instruction is deprecated in Intel's new 64bit ISA, x8664.
Devietti 9
We used the Simics fullsystem simulator to model this proposal.13 The simulatedarchitecture is a generic, inorder 5stage pipeline attached to a standard memory hierarchy withseparate level1 instruction and data caches and a unified level2 cache. The L1 instruction cachehas a 0cycle hit latency, the L1 data cache a 1cycle hit latency. A hit in the L2 cache takes 10cycles, and data memory 200 cycles. While not an accurate reproduction of any modernarchitecture, this model captures a firstorderapproximation of the primary sources ofCCured's overhead: an increased dynamicinstruction count, which will affect an inorder processor even more acutely than amodern outoforder processor, and increasedmemory bandwidth, which will lower theefficiency of the caches
Our proposal was inspired by Witchelet al.’s “Mondrian Memory Protection.”12
Mondrian Memory Protection is areplacement for a segmented memorysystem; it allows for wordgranularity controlof the readwriteexecute permissions of memory. High performance is achieved throughaggressive caching in both a “permissions lookaside buffer” (analogous to a TranslationLookaside Buffer, but searching for a particular memory location’s permissions instead ofphysical address mapping) and “register sidecars” which augment each architectural register withthe permissions of the address that was last looked up from that register.
The idea of a “sidecar” is pivotal to achievingparallel access to pointer metadata. This metadata iseminently associated with a register’s value (the value ofthe pointer itself). Pointer metadata changes lessfrequently than the pointer itself does, and is accessed in ahighly correlated fashion, so caching it based on the valuein a register should exploit extant temporal locality to alarge extent. This leads to a natural way of managing theshadow register file, which is much like that of a regularcache: update only on “misses.” When a bounds checkoccurs, the shadow register file is consulted. If its data is
found to be out of date with the actual pointers (which are accessed “for free” to check thebaseline of our system), then the shadow register file is updated to reflect the actual program
Address Contents0x1000 0xA
0x1004 0xB
0x1008 0xC
0x100C 0xD
0x1010 0xE
Table 3: Sample memory
Figure 7: dijkstra shadow register behavior
pointer base extent
0
20
40
60
80
100
120
140
160
180
200
220
dijkstra shadow register behavior
hits
misses
shadow register
acc
ess
es
Devietti 10
state. Figure 7 demonstrates the extent to which such temporal locality is exploitable, byshowing how often a bounds check against an architectural register “hits” in the correspondingshadow register – almost never for the pointer value itself, but quite frequently for the base andend values. The existence of valid metadata in the shadow register file can still allow a boundscheck to take place quickly, requiring only the volatile pointer value itself to live in thearchitectural register file.
Though the metadata associated with a pointer almost never changes, trying to track thepointer value itself is quite difficult. We first implemented a “snooping” module to intercept allloads from memory and treat them as potential pointers. Given the contents of memory listed inTable 3, a load instruction reading from memory location 0x1008 into register %eax, wouldcause the shadow register for %eax to take on 0xC as its value, and the value 0xB as its base and0xA as its bound (CCured lays out its wide pointers such that the base is at the previous word inmemory, and the end at the word before that). However, loads of nonpointers (as is the case inour example) occur so frequently that they cause the valid metadata written on a shadow register“miss” to be quickly evicted, leading to more misses. Examining some disassembled binariesrevealed that pointer values are also frequently constructed not from reading memory directly butvia the x86 LEA (load effective address) instruction, which despite its name is a purelyarithmetic instruction that adds an offset to a register's value, without going to memory at all.Thus, LEA instructions are not seen by our memorysnooping module.
Our next optimization involved trying to track pointers as they move around thearchitectural register file, through both registertoregister move instructions (MOV) and simplepointer arithmetic operations (ADD, SUB, INC, DEC), though neither of these yield any tangibleimprovement. Finally, our shadow registers maintain a “valid/invalid” bit to indicate whetherthey contain a pointer. This bit is set whenever a miss occurs and valid pointer data is written tothe shadow register. The bit is typically cleared when a registertoregister move occurs, as validpointer data is leaving a register. However, disabling this behavior so that the valid bit can neverbe cleared yields no improvement in the shadow register file's hit rate.
The current implementation, though simplistic, performs no worse than moresophisticated mechanisms. This is likely due to the fact that capturing all loads as potentialpointers casts far too broad a net, while ignoring all loads is too restrictive. Determining whatregisters contain pointers at runtime is a difficult problem, beyond the scope of the heuristicshere explored.
Conclusion and Future WorkThe main opportunity for optimizing this system lies in determining what constitute
pointer references so that they can be tagged as such even before they cause a miss in the shadow
Devietti 11
register file. One possibility here is modeling another extension to the x86 ISA: a LOADPTRinstruction that identifies the target of load as being a pointer, as opposed to just a regular wordin memory. However, as with the x86's deprecated BOUNDS instruction, LOADPTR wouldinevitably specify a memory layout for pointers that compilers would have to adopt. Anotherarea of future work could explore adding hardware support for CCured's FSEQ pointers, seeingin particular if they could leverage the existing shadow registers with a bit of additional state, toindicate the validity of the end value.
This system was designed to leverage CCured's security framework and to try toameliorate CCured's memory and instruction overheads. In that respect, it has something tooffer. Our initial extensions into being more proactive about finding pointers, through snoopingthe memory bus and tracking register file operations, were less successful, which leads us toconclude that solving the thorny problem of securing the C programming language is besttackled by a combination of smart software and aggressive hardware.
1 Necula et al. “CCured: TypeSafe Retrofitting of Legacy Software.” ACM Transactions on Programming Languagesand Systems (TOPLAS), 2004.
2 Boehm, H.J. And M. Weiser. “Garbage collection in an uncooperative environment.” Software – Practice andExperience. 1988.
3 Austin et al. “Efficient Detection of All Pointer and Array Access Errors.” Proceedings of the ACM SIGPLANConference on Programming Language Design and Implementation, 1994.
4 Jones, R. W. M. and P. H. J. Kelly. Backwardscompatible bounds checking for arrays and pointers in C programs.AADEBUG, 1997.
5 Steffen, J. L. “Adding runtime checking to the portable C Compiler.” Software – Practice and Experience 22, 4(April) 1992.
6 Ruwase, Olatunji and Monica Lam. “A Practical Dynamic Buffer Overflow Detector.” Proceedings of the 11th AnnualNetwork and Distributed System Security Symposium, February 2004.
7 Xu et al. “Architecture Support for Defending Against Buffer Overflow Attacks.” Evaluating and Architecting SystemdependabilitY (EASY2) Workshop, October 2002.
8 Lam, Lapchung and Tzicker Chiueh. “Checking Array Bounds Violations Using Segmentation Hardware.”Proceedings of 2005 International Conference on Dependable Systems and Networks, June 2005.
9 Witchel, Emmett, Josh Cates and Krste Asanovic. “Mondrian Memory Protection.” Architectural Support forProgramming Languages and Operating Systems (ASPLOSX), October 2002.
10 Pincus, Jonathan and Brandon Baker. “Beyond Stack Smashing: Recent Advances in Exploiting Buffer Overruns.”IEEE Security and Privacy, July/Aug 2004.
11 http://www.spec.org/osg/cpu2000/CINT2000/ 12 Guthaus, Matthew R., et al. “MiBench: A free, commercially representative embedded benchmark suite.” IEEE 4th
Annual Workshop on Workload Characterization, December 2001.13 http://www.simics.net