BranchTap: Reducing Branch Misprediction Penalty through …moshovos/research/branch... · 2006-08-24 · 2 · P. Akl and A. Moshovos This work proposes BranchTap, a novel checkpoint-aware

BranchTap: Reducing Branch MispredictionPenalty through Speculation Control

PATRICK AKL

and

ANDREAS MOSHOVOS

Department of Electrical and Computer Engineering

University of Toronto, Canada

Authors’ address: Department of Electrical and Computer Engineering, University of Toronto,

Toronto, Canada.Authors’ emails: {pakl, moshovos}@eecg.toronto.edu

Extension of Conference Paper: P. Akl and A. Moshovos, “BranchTap: Improving Per-

formance with Very Few Checkpoints Through Adaptive Speculation Control”, Proceedings ofthe International Conference on Supercomputing (ICS’06), Cairns, Queensland, Australia, June

2006.There is little overlap (less than 30%) with the conference version and this is limited to the

front-end material (before section 5).

The conference paper focused primarily in comparing previously proposed prediction-based check-point methods in order to identify the best performing method. This work omits this exploration

and instead focuses on a more detailed exploration of BranchTap. Specifically, this work contains

the following new sections and experiments that do not appear in the conference version:

—A formal definition of a class of BranchTap mechanisms as defined by a set of parameters

(Section 3.1).

—A study of how checkpointing performance is affected by the processor decode width in addition

to the processors window size (Section 5.2).

—A study that demonstrates that focusing solely on reducing recovery cost is insufficient (Section5.4).

—A study of various sampling and throttling parameters that determine BranchTaps behavior(Section 5.4).

—A study that illustrates the best adaptation policy per benchmark (Section .

—A study that illustrates that BranchTap is orthogonal to prediction-based checkpointing tech-niques (Section 5.6.2).

—A detailed study of adaptive and non-adaptive BranchTap performance for processors withdifferent instruction window sizes and decode widths (Section 5.6.3).

Permission to make digital/hard copy of all or part of this material without fee for personal

or classroom use provided that the copies are not made or distributed for profit or commercialadvantage, the ACM copyright/server notice, the title of the publication, and its date appear, andnotice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish,

to post on servers, or to redistribute to lists requires prior specific permission and/or a fee.c© 2006 ACM 1529-3785/2006/0700-0001 $5.00

ACM Transactions on Architecture and Code Optimization, Vol. 2, No. 3, 09 2006, Pages 1–0??.

2 · P. Akl and A. Moshovos

This work proposes BranchTap, a novel checkpoint-aware speculation strategy that temporarily

throttles control flow speculation to reduce recovery cost while allowing speculation to proceedwhen it is likely to boost performance. BranchTap targets high-performance architectures with

limited checkpoint resources. This work differs from previous proposals for control flow speculation

control primarily in that it accounts for the cost of mispeculation recovery, and in that it proposesdynamic adaptation. BranchTap is orthogonal to the recently proposed checkpoint prediction

and intelligent management techniques. For example, this work demonstrates that for a 1K-

entry window processor with a First-In-First-Out (FIFO) buffer of just four checkpoints anda confidence-based checkpoint predictor with 1K entries, BranchTap achieves performance that

is within 2.53% of that possible with an infinite number of checkpoints. This represents animprovement of 35.6% over using just prediction-based checkpoint allocation.

Categories and Subject Descriptors: C.1.1 [Processor Architectures]: Single Data Stream

Architectures

General Terms: Performance, Design

Additional Key Words and Phrases: Branch Misprediction, Checkpointing, Speculation Control,

Superscalar Processors, Out-of-Order Execution, Processor State Recovery, Confidence Estimation

1. INTRODUCTION

Modern processors use control flow speculation to improve performance. To pre-serve correctness, recovery mechanisms restore the machine’s state on mispecula-tions. Modern processors utilize two such recovery mechanisms. The first is there-order buffer (ROB) which allows recovery at any instruction including mispec-ulated branches. Recovering from the ROB amounts to squashing, i.e., reversingthe effects of each mispeculated instruction, a process that requires time propor-tional to the number of squashed instructions. The second recovery mechanismuses a number of global checkpoints (GCs) which are allocated at decode time. AGC contains a complete snapshot of all relevant processor state. Recovery at aninstruction with a GC is “instantaneous”, i.e., it requires a fixed, low latency.

Ideally, a GC would be allocated at every instruction such that the recoverylatency is always constant. In practice, only a limited number of GCs can be im-plemented without impacting the clock cycle significantly and thus reducing overallperformance (see Section 2.2). For processors with relatively small scheduling win-dows, using a few GCs is sufficient. For example, the MIPS R10000 had a 32-entrywindow and used just four GCs which were allocated to branches in-order [Yeager ].Decoding stalled at a branch while no GCs were available.

Recent work demonstrates that many more GCs are needed for processors withlarger scheduling windows [Akkary et al. 2003; 2004; Cristal et al. 2003; Moshovos2003]. This requirement is at odds with the need for low complexity and low la-tency of operation. Instead of using more GCs, checkpoint prediction allocates fewGCs judiciously to low confidence1, or weak branches (i.e., those that are likely tocause a mispeculation) [Akkary et al. 2003; 2004; Moshovos 2003]. In addition, ad-vanced GC management methods further improve GC efficiency [Moshovos 2003].Section 4 reviews these proposals in more detail along with other related work.

1Because of the high frequency of branch mispredictions relative to interrupts and exceptions, we

only consider control flow instructions as potential causes for machine’s state recovery.

ACM Transactions on Architecture and Code Optimization, Vol. 2, No. 3, 09 2006.

BranchTap: Reducing Branch Misprediction Penalty through Speculation Control · 3

Even with these advances, at least eight and often 16 GCs are needed to maintainperformance within 2% of that possible with an infinite number of checkpoints evenwith a 256-entry window [Moshovos 2003]. A method that improves GC efficiencyfurther is needed as it would lead to the following three benefits: i) it will reduceoverall GC requirements thus improving scalability for future wider window proces-sors, ii) it will reduce the overall cost, and thus power of the GC mechanism, andiii) it would permit embedding those checkpoints within all relevant structures thuseliminating the need for expensive interconnects for checkpoint content transfers(see Section 2.2).

This work proposes BranchTap, a resource-efficient technique to improve per-formance with very few GCs (four or less) for processors with large schedulingwindows (i.e., 512 or 1K instructions). BranchTap complements existing check-point prediction methods by reducing the number of instructions that would needto be squashed on mispeculations. BranchTap preemptively stalls the front-endwhen it is likely that fetching additional instruction would only result in increasingthe cycles lost to recovering after a mispeculation. BranchTap tracks the number ofin-flight weak branches without a GC and stalls the fetch stage while the aforemen-tioned number exceeds a threshold. This work demonstrates that a fixed thresholdBranchTap is suboptimal across programs, and often does not lead to any significantimprovement, even when the threshold is carefully selected. Accordingly, this workproposes a BranchTap that dynamically adapts this threshold using short samplingintervals. This work presents a sensitivity analysis of BranchTap and examines itsperformance across several processor configurations to demonstrate its robustness.For example, this work demonstrates that BranchTap reduces average mispecu-lation recovery cost by 35.6% compared to a state-of-the-art technique that usesjust checkpoint prediction, when only four checkpoints are available for a 1K-entrywindow processor. This work also shows that BranchTap is resource-efficient sincecomplementing existing confidence-table-based checkpoint prediction methods withBranchTap consistently outperforms using a 64 times larger confidence table.

BranchTap builds upon previous work on speculation control for power reduction,e.g., [Grunwald et al. 1998; Manne et al. 1998]. However, as Section 4 explainsin more detail, there are three significant differences: i) previous work assumedfixed recovery latencies at all mispeculations and thus ignored their impact onperformance, ii) the metric, and iii) the policy used by BranchTap are very different.To the best of our knowledge, this is the first work that proposes: i) combiningspeculation control with checkpoint prediction, and ii) an adaptive method foradjusting speculation control.

The rest of this paper is organized as follows. Section 2 reviews existing check-pointing alternatives and discusses the underlying performance trade-offs. Section 3presents BranchTap. Section 4 reviews related work. Section 5 presents the experi-mental analysis of BranchTap. Section 6 summarizes the key findings of this work.

2. CHECKPOINT/RECOVERY BACKGROUND

This section presents a brief overview of existing checkpoint/recovery mechanismsand a discussion of the underlying performance trade-offs. For clarity but withoutloss of generality, this discussion focuses on checkpointing for the register alias table



(RAT). The same concepts are applicable to other processor structures.

2.1 RAT Checkpoint/Recovery

Register renaming eliminates false register dependencies, thus increasing instruc-tion level parallelism. This work assumes the register renaming implementationused in the MIPS R10000 where architectural registers are dynamically mappedonto a larger set of physical registers [Yeager ]. The Register Alias Table (RAT)maintains the current mapping of architectural to physical registers. Every instruc-tion accesses the RAT during the decode phase to first rename the input operandregisters and then to rename the target register to a free physical register. To re-name up to W instructions per cycle for an instruction set with two input and oneoutput register operands, 3 ×W read ports and W write RAT ports are needed.Furthermore, for a N -entry window processor, N physical registers are typicallyused. Thus, each RAT entry contains lg(N) bits.

A decoded instruction with a destination register following a mispredicted branchcorrupts the RAT. After the processor discovers the misprediction, and before it canrename instructions down the correct path, the RAT state needs to be recoveredas it was when the mispredicted branch was decoded.

2.2 ROB and GC Checkpoint/Restore

There are two commonly used methods for RAT checkpointing and recovery. Thefirst is the reorder buffer (ROB) and the second uses a set of global checkpoints.

The ROB is a circular buffer. Instructions allocate an ROB entry in programorder as they enter decode and release it upon commit. The ROB entry containssufficient information for reversing the effects of the corresponding instruction on amispeculation. For RAT recovery it is sufficient to keep the previous mapping forthe instruction’s destination register. The ROB is a fine-grain checkpoint mech-anism as it allows recovery at any instruction as follows: starting from the mostrecently decoded instruction, we traverse the ROB in reverse order while writingback the previous mappings into the RAT until we reach the branch that was mis-peculated. The number of cycles required for ROB recovery is proportional to thenumber of instructions being squashed. It is reasonable to assume that up to Winstructions can be squashed per cycle if the machine is capable of decoding W in-structions per cycle. The ROB facilitates recovery from any exception in additionto branch mispeculations. As Section 5.2.1 shows, processors that rely solely on theROB for recovery incur a significant performance penalty which is worse for largerwindows and narrower superscalar widths. In the rest of this study, we refer to thismechanism as ROB-Only checkpointing.

The second RAT recovery method uses global checkpoints (GCs) which containcomplete RAT content snapshots. Conceptually, as shown in Figure 1(a), the GCsform a queue of complete RAT replicas. As implemented in the R10000 processor,each RAT bit has embedded next to it a small queue (shown in Figure 1(b)).A GC is taken by shifting into the queue a copy of the corresponding RAT bit(all RAT bits are copied in parallel into their own queues). Recovery amounts tocopying the contents of one of the queue elements back to the corresponding RATbit. This copying requires a fixed latency which is independent of the number ofsquashed instructions. Thus, the higher the number of squashed instructions, theACM Transactions on Architecture and Code Optimization, Vol. 2, No. 3, 09 2006.


Fig. 1. GC RAT Checkpointing. (a) Conceptual organization. (b) Actual implementation.

preferable it is to use GC over ROB recovery. GC is a coarse-grain mechanism asit allows recovery only at some instructions. In the rest of this study, we refer tothis mechanism as GC checkpointing.

Ideally, a GC would be taken at every instruction. Unfortunately, implement-ing more GCs impacts RAT power and latency and thus can impact overall per-formance. If the GCs are embedded inside the RAT next to each RAT bit cell,embedding more GC bits elongates either the wordlines or the bitlines or both. IfGCs are implemented separately, additional bit lines are required to communicateall RAT content from or to the GC store. This also elongates the RAT bitlines orwordlines since additional wires will be needed to transfer the RAT content. Forexample, for a 64-entry RAT and a 512-entry physical register file, 64x9 wires areneeded to checkpoint the RAT. Elongating the wordlines or the bitlines increasestheir capacitance, access latency, and power. Larger transistors in the RAT cellsmay also be needed to maintain stability further increasing power dissipation. Ac-cordingly, it is desirable to maintain the number of GCs as small as possible. Thiswork targets four or less GCs for processors with 512- or 1K-entry windows (theMIPS R10000 used four GCs too but it had a window of 32 instructions).

2.3 GC Prediction

Previous work has observed that checkpoint prediction can be used to improve effi-ciency over naive checkpoint allocation [Moshovos 2003]. Specifically, as originallyused in R10000, GCs were allocated to all branches at the decode stage. If noGC was available, decode stalled. GCs were released in-order as the branches wereresolved. Previous work showed that many GCs would be needed for larger windowprocessors if this policy was used [Akkary et al. 2003; 2004; Moshovos 2003]. Ac-cordingly, rather than allocating GCs to all branches previous work suggested usinga confidence mechanism to allocate GCs only for low confidence (weak) branches.[Moshovos 2003] used anyweak, a simple confidence mechanism relying on the biasof existing combined branch predictors, while [Akkary et al. 2003; 2004] used thededicated confidence estimator proposed in [Jacobsen et al. 1996].



Fig. 2. Checkpoint recovery cost when not all branches have a GC and a ROB is available. (1)

First, we find the next branch with a GC, if any. (2) Then we recover at that branch using theGC. (3) Finally, we use the ROB to roll-back to the mispredicted branch.

2.4 Recovering using a GC

When GCs are allocated only to some branches there are two possible recoveryscenarios. In the first “direct recovery” scenario, the mispeculation occurs at abranch with a GC. In this case recovery latency is fixed and independent of thenumber of squashed instructions. In the second “indirect recovery” scenario, themispeculation occurs at a branch without a GC. In this case two possible recoverypolicies are possible. Figure 2 shows the first policy that requires a ROB and whererecovery proceeds into two phases [Moshovos 2003]. First, the closest subsequentGC, if any exists, is used to partially recover the machine state at that instruction,and then the ROB is used to complete the recovery. In this case, the recoverylatency is proportional to the number of instructions in-between the mispeculatedbranch and the closest subsequent branch with a GC (or the end of the ROB ifno such branch exists). In the second policy, the closest preceding branch witha GC is located and used to restore machine state [Akkary et al. 2003; 2004].Instructions following the GC and up to the mispeculated branch are reexecuted2.The advantage of this method is that it can be used without an ROB. However, itrequires the re-execution of correct path instructions. This work focuses on the firstrecovery model. However, BranchTap could also be used with the second recoverymodel. In this case, BranchTap could help when a branch that would cause amispeculation is delayed until a GC becomes available. An investigation of thisaspect is beyond the scope of this work.

2.5 Performance Trade-offs with GC Prediction

When the mispredicted branch was not allocated a GC, the recovery cost can beexpressed as the sum of the number of cycles required to perform the followingthree tasks in sequence:

(1) Time to locate the next GC.

(2) Time to recover partially at the next CG, if any is found.

2The closest earlier checkpoint can be detected early at decode time for all instructions if in-order

checkpoint allocation and release is used as in [Akkary et al. 2003; 2004].



(3) Time to complete the machine state recovery using the ROB, starting at thenext GC if one was found, or at the newest instruction in the ROB.

Checkpoint prediction can improve performance by eliminating as many indirectrecoveries as possible. However, as the instruction window size increases and asthe number of GCs is latency limited, this task becomes increasingly difficult. Al-ternatively, checkpoint prediction can attempt to allocate GCs in a manner thatminimizes the number of instructions between a mispeculated branch without a GCand a subsequent branch with a GC. BranchTap achieves the same effect by con-trolling the rate in which instructions are introduced into the pipeline. As Section 5demonstrates, using a larger and thus more accurate confidence estimator does im-prove performance. However, BranchTap offers a significantly better performancevs. cost tradeoff. BranchTap requires few counters and comparators whose cost isnegligible. For most programs, at least an additional 124Kbits are needed for theconfidence table to provide similar benefits.

3. BRANCHTAP

In developing BranchTap, we observe that instead of just than trying to allocateGCs more effectively, we can also try to minimize the number of instructions thathave to be recovered from the ROB when it is not possible to allocate a GC. This isthe last factor in the aforementioned recovery cost model. BranchTap achieves thisgoal by temporarily stalling the fetch stage while the number of preceding unre-solved weak branches that do not have a GC exceeds a threshold WT . The insightbehind this approach is that when the aforementioned condition holds true, thenthe delayed instructions follow a relatively long sequence of in-flight, unresolvedweak branches. The probability that these instructions will not be squashed is thusrelatively small and becomes smaller the more weak branches are in-flight. If wewere to allow more instructions to proceed, then we are most likely increasing thenumber of instructions that will have to be squashed. Since all GCs are currentlyallocated to earlier branches, these instructions will have to be squashed from theROB and hence recovery latency will only become longer.

However, focusing solely on recovery cost is simplistic. Sometimes, executingwrong path instructions could be beneficial because of instruction or data prefetch-ing effects. Accordingly, a successful speculation control method has to accountboth for mispeculation recovery cost and for mispeculation performance side-effectopportunity loss. BranchTap meets both these requirements.

This work considers two BranchTap alternatives, non-adaptive and adaptive. Inthe non-adaptive BranchTap, a fixed, predetermined threshold is used to throttlethe front-end. Section 5.5 demonstrates that this is suboptimal and sometimesworse than no speculation control. Adaptive BranchTap dynamically adjusts itsthreshold WT. This allows it to adapt across and within applications.

Figure 3 illustrates how BranchTap is integrated into the pipeline and the sam-pling process it uses. We have experimented with many different adaptive poli-cies. The policy that performed well across all benchmarks works in two repeatingphases. During the first phase, execution proceeds for a relatively long time of εcycles with the current threshold WTcurrent. During the second phase, sampling isused to determine whether the threshold should change. The second phase consists



Fig. 3. (a) How BranchTap is integrated into the pipeline (threshold = WT). (b) Threshold

adaptation policy.

of three relatively small sampling sub-phases of ρ cycles each (where ρ << ε) wherewe count the number of committed instructions for three possible threshold val-ues: WTcurrent, WTcurrent-δ and WTcurrent+δ. Using the number of committedinstructions to guide adaptation decisions allows BranchTap to account for bothrecovery cost and mispeculation performance side-effect opportunity loss since it isa direct measure of performance.

The three sampling sub-phases determine the direction towards which the thresh-old should change, i.e., whether WT should i) increase, ii) remain unchanged, oriii) decrease based on which sample resulted in the highest number of committedinstructions. The exact amount by which the threshold changes can be differentthan the amount δ used during the sampling sub-phases. Specifically, the thresholdis adjusted to WTcurrent+ω, WTcurrent, or WTcurrent-ω. As we explain subse-quently, the use of different δ and ω allows BranchTap to improve the accuracy ofthe sampling information while avoiding performance losses due to rapid adapta-tion.

BranchTap requires very few resources (a few counters and comparators, andthe ability to temporarily stall the decode stage). Moreover, as Section 5 shows,BranchTap provides a significantly better cost vs. performance tradeoff than alter-native ways to improve performance.

3.1 Performance Tradeoffs with BranchTap

BranchTap relies on sampling to predict the performance effects of throttling controlflow speculation. There are three key factors that determine BranchTap’s success:i) sampling precision, ii) adaptation speed, and iii) adaptation precision.

(1) Sampling Precision: The sampling periods ρ produce the input informationthat BranchTap uses to determine how to adapt. To improve the accuracy of thecollected information relatively long sampling phases are desirable. However,since different thresholds are applied during these sampling phases, performanceis affected and thus shorter sampling phases are desirable. We define the testtime ratio ψ = 2ρ

ε+3ρ as the portion of time in which we are affecting executionwith sampling (the total duration of an adaptation cycle is ε+3ρ). A good valuefor ψ would keep a balance between the two aforementioned considerations. Wefound experimentally that ψ = 5% worked well. Another factor which affectsthe sampling precision is δ. A small δ might yield noisy measurements, andhence might not give a good indication of the direction on which to change WT,



while a large δ might overshoot the optimal threshold and sample at a thresholdwhich yields a smaller performance than that with the current threshold, thusfailing to indicate that WT should change to maximize performance.

(2) Adaptation Speed and Precision: BranchTap adjusts its threshold progres-sively. Adapting quickly to the optimal threshold is desirable for programs thatexhibit different phases. On the other hand, adapting precisely is desirable assome programs are very sensitive to rapid threshold changes (see Section 5.5).Adapting quickly is possible via very short adaptation phases (of total lengthε+ 3ρ), or by using a large ω, or both. However, adapting precisely requires asmall ω. For these reasons, we chose an adaptation phase of 1M cycles (ε+ 3ρ)and hence the length of each sampling period is ρ=25K cycles since we useψ = 5%.

We determined that combining a small ω (for precise adaptation) with a medium δ(to avoid noisy measurements with small δ and to avoid overshooting the optimalthreshold at sampling with large δ) generally works best.

4. RELATED WORK

[Aragon et al. 2002] analyze the causes of performance loss due to branch mis-predictions and find that the pipeline-fill penalty, which consists of the cycles lostbetween the time when the misprediction is discovered, and the time when the firstinstruction from the correct path is renamed, contributes significantly to the overallperformance lost due to mispredictions. As explained in Section 2.1, this is mainlycaused by the processor state recovery latency. BranchTap reduces the pipeline-fillpenalty.

BranchTap is orthogonal to checkpoint prediction methods [Akkary et al. 2003;2004; Moshovos 2003]. Section 5.2.3 validates this observation experimentally.BranchTap can be used with an ROB as in [Moshovos 2003] or without one3 asin [Akkary et al. 2003; 2004]. In this study we focus only on using BranchTap withan ROB.

BranchTap builds upon previous work on speculation control for power reduction[Grunwald et al. 1998; Manne et al. 1998]. With BranchTap there are three keydifferences. First, previous work assumed a fixed recovery cost from mispeculations.Accordingly, the performance trade-offs we are interested in were not accounted forin previous work where speculating more could never hurt performance due to anincreased recovery cost. Second, previous work relied on counting the number ofunresolved, weak branches currently in flight. We instead rely on those branchesthat do not have a GC since we combined speculation control with checkpointing.Finally, the most important difference is that previous work used a fixed threshold.We demonstrate that using a fixed threshold is suboptimal across programs andthat the performance differences can be significant.

BranchTap relies on a confidence estimator for identifying weak branches. Thiswork focuses on using the confidence estimator proposed in [Jacobsen et al. 1996].Jimenez and Lin studied composite branch confidence estimators that are more

3In this case, BranchTap could improve performance if it delays a branch long enough for a GC

to become available. Different performance tradeoffs apply in this case.



accurate but require more resources [Jimenez and Lin ]. The Anyweak estimatoris less accurate but requires virtually no resources [Moshovos 2003]. Section 5.6.1shows that BranchTap is orthogonal to the choice of the confidence estimator andthus it can be used to boost performance as needed with little additional cost.

Another approach to reducing the cost of mispeculations is to execute multiplecontrol flow paths at hard to predict branches using predication techniques, e.g.,[August et al. 1997] or dynamically, e.g., [Wallace et al. 1998; Heil and Smith1996; Tyson et al. ]. A simpler approach consists of only fetching and decodinginstructions from the non-predicted path, while executing instructions from thepredicted path normally [Aragon et al. 2002]. Also, control independence can beexploited to avoid complete re-execution of some instructions that are squashed[Cher and Vijaykumar 2001; Chou et al. 1999; Gandhi et al. 2004; Rotenberg et al.1999; Sodani and Sohi 1997]. These approaches are orthogonal to BranchTap andrequire additional support.

Modern checkpoint/recovery mechanisms have evolved out of earlier proposals forsupporting speculative execution [Hwu and Patt 1987; Smith and Pleszkun 1988;Sohi 1990].

Other proposals for improving scalability for wider window processors targetearly reclamation of processor resources [Cristal et al. 2003; Martnez et al. 2002].In this work, no instructions can be renamed while a RAT recovery is in progress.A more aggressive method where some of the instructions down the correct controlpath may be renamed while RAT recovery is still in progress is proposed in [Zhouet al. 2005]. This method requires several changes in the RAT. It may be possibleto use BranchTap with these techniques. However, this investigation is beyond thescope of this paper.

5. EVALUATION

Section 5.1 details our experimental methodology. Section 5.2 establishes the needfor improving performance with very few GCs. Section 5.3 demonstrates that oncecheckpoint prediction is used most of the cycles lost to recovery are caused bybranches without a GC. This result motivates BranchTap that reduces the numberof cycles lost to recovery. Section 5.4 demonstrates that focusing solely on reduc-ing the number of recovery cycles is simplistic as speculation sometimes indirectlyimproves performance. Section 5.5 shows that different programs require differentspeculation control thresholds (WT) to perform best, and that fixed threshold spec-ulation control is not robust. This result motivates the use of adaptive speculationcontrol in BranchTap. Finally, Section 5.6 studies adaptive BranchTap demonstrat-ing that it is a robust performance enhancing technique and that it is orthogonalto exiting checkpoint prediction-based methods.

5.1 Methodology

We used Simplescalar v3.0 [Burger and Austin ] to simulate the processor detailedin Table I. We compiled the SPEC CPU 2000 benchmarks for the Alpha 21264architecture using HPs compilers and for the Digital Unix V4.0F using the SPECsuggested default flags for peak optimization. All benchmarks were ran using areference input data set. It was not possible to simulate some of the benchmarksdue to insufficient memory resources. The following SPEC CPU 2000 benchmarksACM Transactions on Architecture and Code Optimization, Vol. 2, No. 3, 09 2006.


Table I. Base Processor ConfigurationBranch Predictor Fetch Unit

8K-entry GShare and 8K-entry bi-modal Up to 4 or 8 instr. per cycle8K Selector 64-entry Fetch Buffer

2 branches per cycle Non-blocking I-Cache

Issue/Decode/Commit Scheduler

any 8-instr./cycle 512- or 1K-entry/half size LSQ

FU Latencies Main Memory

Default simplescalar values Infinite, 200 cycles

L1D/L1I Geometry UL2 Geometry

64KBytes, 4-way set-associative 1MByte, 8-way set-associativewith 64-byte blocks with 64-byte blocks

L1D/L1I/L2 Latencies Cache Replacement

3/3/16 Cycles LRU

Fetch/Decode/Commit Latencies

4 cycles + cache latency for fetch

are included in our experiments: ammp, applu, apsi, art, bzip2, crafty, eon, equake,facerec, fma3d, galgel, gap, gcc, gzip, lucas, mcf, mesa, mgrid, parser, swim, twolf,vortex, vpr and wupwise. To obtain reasonable simulation times, samples weretaken for one billion committed instructions per benchmark. We first skipped 100billion committed instructions prior to collecting measurements for all benchmarksexcept for art and it parser for which we only skipped 20 billion instructions.

5.1.1 Base Checkpoint Prediction Method. BranchTap can be used with or with-out an underlying checkpoint prediction-based method. In most experiments of thissection, we use a state-of-the-art checkpoint prediction based method. We deter-mined the best performing checkpoint prediction based method experimentally. Inthe interest of space, we omit this experimental analysis and simply summarize theresults. The interested reader can consult [Akl and Moshovos 2006].

(1) Checkpoint prediction: A confidence estimator identifies hard-to-predictbranches. The confidence table consists of 1K, 4-bit resetting counters (e.g.,the equivalent to 16 bits of history per context [Jacobsen et al. 1996]).

(2) Checkpoint allocation: Checkpoint allocation occurs at the decode stage.Only low confidence branches are allowed to use GCs. If a low confidencebranch is decoded and no free checkpoint is available, decode does not stall.

(3) Checkpoint release: A checkpoint is released whenever all previous brancheshave been resolved.

(4) Stages stalled during recovery: After a mispredicted branch is resolved,the fetch stage is redirected in a single cycle. The rest of the front-end stallsuntil RAT recovery completes.

(5) Checkpoint management: Checkpoints are allocated and de-allocated in-order. In previous work, we found that using out-of-order checkpoint allocationand release offers small benefits when there are very few GCs [Akl and Moshovos2006].

5.1.2 Performance Metrics: Average and Maximum Performance Deterioration.Unless otherwise noted, all performance results are normalized over the performance



of an identical configuration which uses an infinite number of checkpoints whichare allocated at all branches (INF-CHK). Moreover, this base configuration usesunrestricted speculation as it never throttles control flow speculation. Ignoringspeculation side-effects, INF-CHK represents an upper bound on the performancepossible with BranchTap.

Whenever possible we present per benchmark performance results. However, inthe interest of space we often present two summary performance metrics: averageperformance deterioration and maximum deterioration over INF-CHK. Maximumperformance deterioration is important for two reasons: i) poor performance on fewbenchmarks may not noticeably affect average performance over all benchmarks,and ii) we are interested in improving performance in a robust manner. It is im-portant to avoid corner cases where the performance of few benchmarks is affectedseverely just by poor checkpointing performance.

5.2 Existing Checkpointing Alternatives

This section validates that ROB-Only checkpointing is not a viable alternative forfuture high performance processors, and that GC-based checkpointing requires ahigh number of checkpoints to perform well.

5.2.1 ROB-Only Checkpointing. Figure 4 reports the average and the maximumperformance deterioration of ROB-Only with respect to INF-CHK as a function ofscheduler window size (128-entry to 1K-entry windows) and instruction width (fourand eight). Performance deterioration increases with instruction window size. Thisis because more wrong path instructions following a mispredicted branch make theprocessor state recovery longer. Performance deterioration is higher for smallerwidth processors. This is best understood by the additional results shown in Fig-ure 5 which represent the average decode width utilization as a function of windowsize and decode width. Higher decode width processors utilize a smaller portionof the available decode width. Since the recovery speed increases proportionallywith the decode width as explained in Section 2.1, this makes the recovery-speedto pipeline-filling-speed ratio smaller for smaller width processors.

Per benchmark behavior varies significantly. While this is not shown on thegraph, swim, mgrid, applu, and lucas performed well even with ROB-Only check-pointing, across all configurations studied. These programs exhibit a relatively lowrate of branches (e.g., 0.33% for applu). We do not omit these programs from therest of this evaluation, however, none of the techniques we discuss can improve theirperformance. ROB-Only is inadequate for some programs as indicated by the worstcase deterioration measurements (e.g., 33% performance loss on a 1K-entry win-dow processor with a decode width of four). These results validate that ROB-Onlyrecovery is not a viable alternative for wide-window processors.

5.2.2 GC and ROB Recovery. Since ROB-Only checkpointing is inadequate,GCs can be used to speedup recovery on mispeculated branches. We study theperformance of the state-of-the-art GC-based checkpointing mechanism which wedescribed in Section 5.1.1 and where checkpoint prediction guides GC allocation.Figure 6 shows the average performance deterioration relative to INF-CHK as afunction of the number of GCs used (X-axis), the decode width (different curves),and the instruction window size (different graphs). Performance deterioration im-ACM Transactions on Architecture and Code Optimization, Vol. 2, No. 3, 09 2006.


Fig. 4. ROB-Only Recovery: Average and maximum performance deterioration as a function of

the scheduler window size (X-axis) and the decode width (different curves). Lower is better.

Fig. 5. Average decode width utilization as a function of the scheduler window size (X-axis) and

the decode width (curves) for ROB-Only recovery.

proves with more GCs. Comparing the results with ROB-Only performance, evenusing a single GC reduces performance deterioration by more than 50%. Withfour checkpoints, average performance loss is 2% and 4% for 512- and 1K-entrywindow processors respectively. Performance does not saturate to INF-CHK evenwith 64 checkpoints. This is because a GC can only be allocated at a low confi-dence branch, and hence a mispredicted high confidence branch always leads to anindirect, slower recovery. Performance saturates to INF-CHK when no checkpointprediction is used, or when out-of-order checkpoint allocation/release and to check-point stealing are used [Moshovos 2003]. In the interest of space and because wefocus on very few checkpoints in the rest of the evaluation we do not present theseresults. The interested reader can consult [Akl and Moshovos 2006].

Figure 7 shows the per-benchmark performance deterioration as a function ofthe number of GCs and the instruction window size (different bars), for a processorwith a decode width of four. Behavior varies significantly across benchmarks. Inswim, mgrid, applu, and lucas, branch mispeculation recoveries are infrequent anddo not impact performance noticeably. Galgel suffers significantly even with 64



Fig. 6. Average performance deterioration for GC-based checkpointing as a function of the decodewidth (different curves) and the number of available GCs and for instruction window sizes of 512

(left) and 1024 (right). Lower is better.

Fig. 7. Per benchmark performance deterioration for GC-based checkpointing relative to INF-CHK as a function of the number of GCs and the instruction window size (512- and 1K-entry on

different bars) for a processor with a decode width of four. Lower is better.

checkpoints (3.3% and 7.4% respectively for 512- and 1K-entry window processors).Performance is almost completely insensitive to the number of GCs available forthis benchmark. This is caused by expensive high confidence branch mispredictions.For this benchmark, it is better to allocate checkpoints directly to branches withoutconsidering the confidence information. Facerec exhibits similar behavior. Butperformance for most benchmarks improves with the number of GCs. AdditionalGCs either get assigned to mispeculated branches or help to reduce the cost ofindirect recoveries. With a decode width of eight, the trends observed are thesame, however, the absolute performance deteriorations are slightly smaller formost benchmarks.

Overall, eight checkpoints were enough to keep the average performance lossACM Transactions on Architecture and Code Optimization, Vol. 2, No. 3, 09 2006.


Fig. 8. Average performance deterioration relative to INF-CHK as a function of the confidencetable number of entries, the processor width (different curves), and for a 512- (left graph) and

1K-entry (right graph) window processor with 4 GCs. Lower is better.

within 1% and 2% respectively for processors with 512- and 1K-entry window sizes.However, some benchmarks still suffered significantly even with eight checkpoints(e.g., twolf which suffers respectively from 5.1% and 10.1% deterioration). Thisresult motivates the need for further improvement.

5.2.3 Improving Checkpoint Allocation. It is possible to improve checkpoint allo-cation, and hence reduce the performance loss by using a more accurate confidenceestimator. Figure 8 shows that using larger confidence tables improves perfor-mance. However, even small performance improvements require significantly largerconfidence tables. As Section 5.6 shows, using BranchTap with a 1K-entry tableprovides better performance than using a 64K-entry table alone.

5.3 Most of Recovery Cost is the Result of Indirect Recoveries

To motivate BranchTap, we first demonstrate that with a checkpoint predictionmechanism in place most mispeculations occur on branches that have a GC but mostof the performance loss is caused by branches that do not have a GC. Figure 9 showsthe percentage of mispredictions that lead to indirect recoveries (left bars), as well asthe percentage of the total recovery cycles that are the result of indirect recoveries(right bars). The left bars show that for most benchmarks, most mispeculatedbranches do have a GC as the percentage of indirect recoveries is small. Theright bars show that many and often most of the cycles lost on recovering froma mispeculation are caused by indirect recoveries. This result demonstrates thatthere is significant potential for improving performance by reducing recovery costwith BranchTap.

5.4 Minimizing Recovery Cost does not Maximize Performance

Section 3 discussed how recovery latency reduction is not a sufficient conditionto improve performance. Wrong path instructions interfere with the correct pathinstructions and contend for memory and processor resources and hence can reduceperformance. Wrong path instructions can also improve performance whenever



Fig. 9. Per-benchmark and average percentage of indirect recoveries (left bars) and percentage

contribution of indirect recovery latency to total recovery cost (right bars) for a 512-entry windowprocessor with four GCs and with a decode width of four.

Fig. 10. Per-benchmark and average percentage performance deterioration relative to INF-CHK

when exact throttling is applied for a 512-entry window processor with four GCs and a decodewidth of four. Lower is better.

they have data or instruction prefetching effects. In order to demonstrate that aspeculation control method has to account for both performance effects and thatit cannot focus solely on reducing recovery cost, we model a hypothetical exactthrottling mechanism that stalls fetching instructions as soon as a mispredictedbranch is decoded, and until that branch is resolved. This effectively reduces thenumber of wrong path instructions. This configuration assumes exact confidenceprediction, which is presently impossible. In this experiment an infinite number ofcheckpoints is available. Thus, exact throttling represents the performance possiblewith a mechanism that minimizes recovery cost.

Figure 10 shows the per-benchmark and average performance deterioration ofexact throttling relative to INF-CHK for 512- (left bars) and 1K-entry (right bars)window processors with a decode width of four. Ideally, exact throttling wouldACM Transactions on Architecture and Code Optimization, Vol. 2, No. 3, 09 2006.


perform at least as well as INF-CHK (performance deterioration of 0%) or better(negative performance deterioration) because of reduced resource contention andthe elimination of recovery cycles. As the figure shows this is the case for manybenchmarks. However, for several benchmarks (vpr, art, mcf, and bzip2) exactthrottling performs significantly worse compared to INF-CHK. This result demon-strates that focusing solely on recovery cost reduction is simplistic and does notlead to best performance. Section 5.6 shows that BranchTap successfully accountsfor both recovery cost and speculation side-effect opportunity loss.

5.5 Non-Adaptive BranchTap: Fixed-Threshold Speculation Control

This section demonstrates that the performance of the non-adaptive BranchTapis suboptimal across programs. We determined the best fixed threshold value foreach benchmark experimentally by running each benchmark several times, eachtime with a different threshold including a run with no threshold (i.e., unrestrictedspeculation, or U Spec). We add speculation control on top of ROB-Only andGC-based checkpointing configurations. In the interest of space, we focus on a 512-entry window processor with a decode width of four and four GCs. Section 5.6.3summarizes results for the other processor configurations. Table II shows the per-benchmark deterioration relative to INF-CHK as a function of the threshold WT(listed along the header row). The best performing fixed threshold policy per bench-mark is highlighted in boldface and the corresponding best threshold in also listedunder the BEST WT column for clarity. These results demonstrate that differentprograms prefer different threshold and that the differences in threshold values aresignificant across programs. The previous to last row shows the average perfor-mance deterioration across all benchmarks and for each different threshold value.The threshold which worked best on average (WT=32) lead to an average perfor-mance deterioration of 2.30%, which is marginally better than the base performanceof unrestricted speculation which results in a 2.39% performance deterioration.

Individual benchmarks behave quite differently under different thresholds. Thereare three classes of programs. First there are threshold insensitive programs suchas swim, mgrid, applu and lucas. These programs exhibit few mispeculations whoseoverall performance impact is negligible. The second class of programs exemplifiedby mcf prefer unrestricted speculation. While unrestricted speculation increases thenumber of recovery cycles (this result is not shown), this cost is amortized by thesignificant performance benefits that result from the prefetching side-effects of mis-peculated instructions. Finally, there are programs that prefer a specific thresholdvalue. For example,twolf prefers limited speculation (WT = 4). A smaller thresh-old (WT < 4) throttles too often correct path speculation, while a larger threshold(WT > 4) causes excessively long recoveries. Other benchmarks, such as for exam-ple mesa and eon, perform better with very tight speculation throttling (WT < 4).Restricted speculation decreases significantly the total cycles lost to recoveries, andreduces resource contention by eliminating many wrong path instructions.

The last row on Table II reports average performance deterioration with an oraclenon-adaptive BranchTap that uses a different threshold per benchmark. Specifi-cally, this method uses the best fixed threshold per benchmark. This result demon-strates that if it was possible to dynamically adapt the threshold to suit the re-quirements of every program, then the performance benefits would be significant.



Table II. Fixed Threshold policy performance deterioration relative to INF-CHK. Lower is betrer.Fixed Threshold (WT)

Bench. Best WT 1 2 4 6 8 16 32 Unrestricted

gzip 2 1.06 0.54 0.71 0.93 1.12 1.45 1.51 1.52

wupwise 1 1.57 1.81 1.89 1.87 1.90 1.87 1.87 1.87

swim 8 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.01

mgrid 2 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01

applu 2 0.04 0.04 0.05 0.05 0.05 0.05 0.05 0.05

vpr 4 4.71 3.44 3.14 3.43 3.87 5.37 6.36 6.41

gcc 4 3.55 2.71 2.28 2.30 2.39 2.32 2.44 2.53

mesa 1 0.51 0.79 1.31 1.57 1.67 1.77 1.78 1.78

galgel 1 3.16 3.28 3.28 3.28 3.28 3.28 3.28 3.28

art Unrestricted 4.19 3.25 0.43 0.35 0.34 0.34 0.34 0.34

mcf 32 16.24 12.61 8.78 6.47 5.03 2.60 1.65 1.90

equake Unrestricted 3.48 2.99 3.00 2.55 2.64 2.61 2.59 2.52

crafty 2 1.30 0.47 0.86 1.44 1.82 2.36 2.60 2.64

facerec 1 1.61 1.72 1.62 1.67 1.67 1.63 1.68 1.64

ammp 2 2.31 2.04 2.06 2.21 2.36 2.42 2.47 2.46

lucas F 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00

fma3d 1 -0.55 0.02 0.17 1.78 0.59 0.74 0.89 0.89

parser 2 1.55 1.13 1.27 1.70 2.15 2.85 3.11 3.24

eon 1 0.12 0.36 1.09 0.54 0.86 1.13 1.22 1.22

gap 2 1.57 1.23 1.33 1.58 1.82 2.18 2.22 2.23

vortex 4 3.83 2.42 1.84 1.97 2.19 2.72 2.83 2.83

bzip2 32 6.23 5.00 4.04 3.67 3.44 2.89 2.64 2.64

twolf 4 4.17 2.96 2.00 2.01 2.50 4.74 6.17 6.45

apsi 2 0.26 -0.07 0.08 0.22 0.26 0.14 0.31 0.34

AVG 32 6.00 4.58 3.35 2.84 2.58 2.32 2.30 2.39

Oracle AVG 1.38 2.39

Specifically, it would be possible to reduce decrease average deterioration from2.39% down to to 1.38% (a 42% reduction).

In the interest of space, we note that a similar analysis for the other processorconfigurations (1K-entry window size, decode width of eight, ROB-Only configura-tion) resulted in similar trends, often with larger absolute percentages. Section 5.6.3summarizes these experiments.

5.6 Adaptive BranchTap: Adaptive-Threshold Speculation Control

Section 5.6.1 considers the tradeoff between sampling precision and adaptationspeed and precision as per the discussion of Section 3.1. Section 5.6.2 shows thatBranchTap can improve performance even when better checkpoint prediction is usedand that it offers a much better recourse vs. performance tradeoff. Section 5.6.3summarizes performance for various processor configurations demonstrating thatBranchTap robustly improves performance.

5.6.1 Adaptive BranchTap: Tuning δ and ω. This section considers the tradeoffbetween sampling precision and adaptation speed and precision as it is capturedby the parameters δ and ω respectively (see Section 3.1). Figure 11 shows theper benchmark and average performance deterioration relative to INF-CHK forunrestricted speculation, and adaptive BranchTap configurations with ω = 1 (whatACM Transactions on Architecture and Code Optimization, Vol. 2, No. 3, 09 2006.


Fig. 11. The sampling precision and adaptation precision and speed tradeoff. Per benchmark andaverage percentage performance deterioration relative to INF-CHK with unrestricted speculation

(U Spec) and BranchTap (BT ω δ), and for a 512-entry window processor with a decode width

of four and four GCs. Lower is better.

we add or subtract to WT if we chose to adapt it) and δ (what we add or subtract toWT when sampling) ranging from 1 to 4, and for a 512-entry window processor witha decode width of four and four GCs. For clarity, the BranchTap configurations aremarked as BT ω δ. We chose ω = 1 to be able to adapt precisely. In the interestof space, we note that we studied other configurations with different ω values, butwe found that ω = 1 performed better. A small ω facilitates adaptation precision.To facilitate quick adaptation we chose a short adaptation cycle of 1M cycles.

Average performance indicates that δ = 3 works best on the average and for mostbenchmarks. As explained in Section 3, lower values of δ yield more noisy samplingmeasurements (BranchTap with δ = 1 even deteriorates average performance),while higher values for δ are undesirable for two reasons: i) the sampling trialsaffect performance more, and ii) a large δ may overshoot the optimal threshold.

Figure 12 compares adaptive BranchTap with unrestrictive speculation using twometrics. The left bars show the reduction in the total number of recovery cycleswith adaptive BranchTap expressed as a percentage of the recovery cycles incurredby unrestricted speculation. The right bars show the reduction in the performancedeterioration over INF-CHK with adaptive BranchTap (δ = 3 and ω = 1) expressedas a percentage of the performance deterioration of unrestricted speculation overINF-CHK. Both measurements are for a 512-entry window processor with a decodewidth of four and four GCs. On average, BranchTap decreases the cycles lostto recovery by 25.1% and improves performance deterioration over INF-CHK by29.2% (relative decrease). The per benchmark results illustrate that decreasingrecovery cycles does not necessary yield a proportional increase in performance.For example, eon sees a reduction in recovery cycles of just 18% which yields areduction in performance deterioration of 88%. Gcc sees a reduction in recoverycycles of 18% but this does not result in a significant performance improvement.



Fig. 12. Per benchmark and average reduction in the total recovery cycles with BranchTap

(BT 1 3) relative to unrestricted speculation (left bars), and reduction in performance deterio-ration relative to INF-CHK with BranchTap (right bars), and for a 512-entry window processor

with 4 GCs. Higher is better.

5.6.2 Effect of the Confidence Estimator. This section demonstrates that Branch-Tap improves performance when used with different confidence estimators. Fur-thermore, it shows that using BranchTap offers better performance than simplyincreasing confidence table estimation accuracy (something that requires signifi-cantly more resources). Thus far, we have assumed a 1K-entry table of resettingcounters [Jacobsen et al. 1996]. Figure 13 shows the average performance deterio-ration relative to INF-CHK with unrestricted speculation and BranchTap under theAnyweak estimator [Moshovos 2003] and several resetting counter confidence esti-mators with a different number of table entries, for a 512-entry window processorwith a width of four and four GCs.

With larger confidence tables, unrestricted speculation performance improves be-cause of a better allocation of the GCs. BranchTap always provides performancebenefits, even when the Anyweak predictor is used, indicating that BranchTap is or-thogonal to the choice of the underlying confidence estimator or its size. When usingBranchTap, at least a 256-entry confidence table is needed to provide better per-formance than that achieved with an Anyweak estimator, indicating that Anyweakcan be a viable, low cost alternative for confidence estimation. More importantly,the results show that complementing confidence-based checkpoint prediction withBranchTap consistently outperforms a confidence estimator with 64 times more en-tries (e.g., performance with a 1K-entry confidence table and BranchTap comparedto the performance with the 64K-entry confidence table alone).

5.6.3 Adaptive BranchTap Performance Sensitivity Analysis. This section stud-ies how BranchTap performs for various processors configurations to demonstrateits robustness. We vary the instruction window size, the decode width, and thenumber of checkpoints.

Table III compares unrestricted speculation, non-adaptive BranchTap, and adap-tive BranchTap. For each processor configuration (window size, number of GCs,and decode width), we experimentally determined the best optimal fixed thresholdACM Transactions on Architecture and Code Optimization, Vol. 2, No. 3, 09 2006.


Fig. 13. Average performance deterioration relative to INF-CHK with unrestricted speculation

and BranchTap under different underlying confidence estimators, and for a 512-entry windowprocessor with a decode width of four and four GCs. We show the results with the resetting

counters confidence estimator for various number of confidence table entries, and with the Anyweak

estimator on the leftmost two bars. Lower is better.

Table III. BranchTap performance with 4 GCs. Lower is better.WindowSize/DecodeWidth

512/4 512/8 1024/4 1024/8

Unrestricted SpeculationAverage Deterioration 2.39 1.66 3.93 2.81

Worst Case Deterioration 6.45 7.73 12.00 8.99

Non-Adaptive BranchTapOptimal Fixed Threshold 32 32 16 16Average Deterioration 2.30 1.64 3.12 2.46Worst Case Deterioration 6.36 7.57 7.44 6.03

Adaptive BranchTapOptimal Configuration BT 1 3 BT 1 3 BT 1 4 BT 1 3Average Deterioration 1.69 1.28 2.39 1.74Worst Case Deterioration 3.56 3.74 7.43 4.99

policy as well as the best adaptive BranchTap policy. These configurations arelisted on the table. The best results per processor configuration are highlighted inboldface. The results show that the adaptive BranchTap robustly reduces averageand worst case deterioration over unrestricted speculation. This is not the case forthe non-adaptive BranchTap.

Tables IV, V, and VI, show the same set of results with two, one, or no GCs.The adaptive BranchTap always performs best on average. For a 1K-entry windowprocessor with a decode width of four and no GCs (ROB-Only recovery scheme),the average performance improvements can be as high as 12.6% over unrestrictedspeculation and 1% over the non-adaptive BranchTap. In terms of worst caseperformance deterioration, the adaptive BranchTap generally outperforms the non-adaptive BranchTap but in some cases it performs worse albeit only slightly. Theperformance benefits of adaptive BranchTap relative to unrestricted speculation inworst case deterioration can be as high as 18.2% for a 1K-entry window processorwith no checkpoints.



Table IV. BranchTap performance with 2 GCs.WindowSize/DecodeWidth

512/4 512/8 1024/4 1024/8



Non-Adaptive BranchTap

Optimal Fixed Threshold 8 16 8 16

Average Deterioration 3.76 2.52 4.57 3.40


Adaptive BranchTap

Optimal Configuration BT 1 3 BT 1 3 BT 1 4 BT 1 3

Average Deterioration 3.01 1.79 3.71 2.63Worst Case Deterioration 7.45 4.82 7.30 5.92

Table V. BranchTap performance with 1 GC.WindowSize/DecodeWidth

512/4 512/8 1024/4 1024/8



Non-Adaptive BranchTapOptimal Fixed Threshold 8 8 8 8Average Deterioration 5.16 3.61 6.11 4.44Worst Case Deterioration 10.13 6.62 10.22 7.37

Adaptive BranchTapOptimal Configuration BT 1 2 BT 1 4 BT 1 2 BT 1 2Average Deterioration 4.27 2.71 4.81 3.39Worst Case Deterioration 7.00 5.75 8.40 7.25

Table VI. BranchTap performance with ROB-Only Recovery (no GCs).

WindowSize/DecodeWidth512/4 512/8 1024/4 1024/8



Non-Adaptive BranchTapOptimal Fixed Threshold 4 6 4 6Average Deterioration 9.31 6.69 10.42 8.10


Adaptive BranchTapOptimal Configuration BT 1 1 BT 1 2 BT 1 1 BT 1 2Average Deterioration 8.67 6.24 9.48 7.29


The optimal fixed threshold increases with the number of GCs. For example,with no GCs, the optimal threshold is in the range 4 ≤ WT ≤ 6, whereas with fourGCs, the optimal threshold is in the range 16 ≤ WT ≤ 32. Similarly, configurationswith a smaller number of GCs also generally perform better with smaller values forδ. For example, with no checkpoints the optimal δ was either one or two whereaswith four checkpoints, it was either three or four.

6. CONCLUSIONS

We have presented BranchTap, a resource-efficient technique that combines adap-tive speculation control and checkpoint prediction to improve performance whenvery few, or even no global checkpoints are available. Prediction-based methodsimprove performance by trying to allocate GCs to those branches that are likelyACM Transactions on Architecture and Code Optimization, Vol. 2, No. 3, 09 2006.


to cause a mispeculation. However, this task becomes increasingly harder as thewindow increases and as the number of available GCs is kept low. This workdemonstrated that there is significant potential for improvement by trying to alsoreduce the number of instructions that would have to be squashed. This approachis complementary to existing checkpoint prediction methods. This work proposedBranchTap, a method that preemptively throttles control flow speculation to reducethe number of cycles lost to recovering from mispeculations. BranchTap successfullytackles the complex performance tradeoffs that apply to speculation control permit-ting deep speculation when this leads to performance enhancing side-effects. Thiswork showed that BranchTap robustly improves performance over various processorconfigurations. For example, it showed that for a 1K-entry window processor andwith just four checkpoints BranchTap reduces average and worst case performancedeterioration by 35.6% and 38.3% respectively. Since BranchTap requires little ad-ditional resources and is relatively straightforward to implement, it is preferableover improving the accuracy of checkpoint prediction by increasing the size of theunderlying confidence estimator. We have observed that BranchTap is better thanfixed threshold policies in that it offers better average performance and also offerslower variability in the performance loss.

Acknowledgments

The authors would like to thank Ioana Burcea, Elham Safi, Jason Zebchuk, and theanonymous reviewers for their valuable comments. This research was supported bya CFI equipment grant, an Intel Corporation equipment donation, funds from theUniversity of Toronto, an NSERC Discovery Grant, and an Intel Research Grant.

REFERENCES

Akkary, H., Rajwar, R., and Srinivasan, S. 2003. Checkpoint processing and recovery: To-

wards scalable instruction window processors. In Proceedings of the 36 th Annual IEEE/ACMInternational Symposium on Microarchitecture. IEEE/ACM, San Diego, CA.

Akkary, H., Rajwar, R., and Srinivasan, S. 2004. An analysis of resource efficient checkpointarchitecture. ACM Transactions on Architecture and Code Optimization (TACO) 1, 4 (Dec.),

418–444.

Akl, P. and Moshovos, A. 2006. Branchtap: Improving performance with very few checkpoints

through adaptive speculation control. In Proceedings of the 20 th International Conference on

Supercomputing. ACM, Cairns, Australia.

Aragon, J. L., Gonzalez, J., Gonzalez, A., and Smith, J. E. 2002. Dual path instruction

processing. In Proceedings of the 16 th International Conference on Supercomputing. ACM,New York, NY, 220–229.

August, D. I., Hwu, W., and Mahkle, S. 1997. A framework for balancing control flow andpredication. In Proceedings of the 30 th International Symposium on Microarchitecture.

Burger, D. and Austin, T. The simplescalar tool set v2.0. Technical Report UW-CS-97-1342 .

Cher, C. Y. and Vijaykumar, T. N. 2001. Skipper: A microarchitecture for exploiting control-

flow independence. In Proceedings of the 34 th International Symposium on Microarchitecture.

Chou, Y., Fung, J., and Shen, J. P. 1999. Reducing branch misprediction penalties via dy-

namic control independence detection. In Proceedings of the 13 th International Conference onSupercomputing. ACM, Rhodes, Greece, 109–118.

Cristal, A., Ortega, D., Llosa, J., and Valero, M. 2003. Kilo-instruction processors. InProceedings The 5 th International Symposium on High Performance Computing (ISHPC’V).



Gandhi, A., Akkary, H., and Srinivasan, S. T. 2004. Reducing branch misprediction penalty

via selective branch recovery. In Proceedings of the 10 th International Symposium on HighPerformance Computer Architecture (HPCA’04). IEEE, Madrid, Spain, 254–264.

Grunwald, D., Klauser, A., Manne, S., and Pleszkun, A. 1998. Confidence estimation for

speculation control. In Proceedings of the 25 th Annual International Symposium on Computer

Architecture. IEEE, Barcelona, Spain, 122–131.

Heil, T. H. and Smith, J. E. 1996. Selective dual path execution. Technical Report, University

of Wisconsin, Madison.

Hwu, W. W. and Patt, Y. N. 1987. Checkpoint repair for out-of-order execution machines. In

Proceedings of the 14 th Annual Symposium on Computer Architecture. IEEE, Pittsburgh, PA,18–26.

Jacobsen, E., Rotenberg, E., and Smith, J. E. 1996. Assigning confidence to conditional branch

predictions. In Proceedings of the 29 th Annual International Symposium on Microarchitecture.IEEE, Paris, France, 142–152.

Jimenez, D. A. and Lin, C. Composite confidence estimators for enhanced speculation control.

Technical Report TR-02-14, Department of Computer Sciences, The University of Texas at

Austin.

Manne, S., Klauser, A., and Grunwald, D. 1998. Pipeline gating: Speculation control forenergy reduction. In Proceedings of the 25 th Annual International Symposium on Computer

Architecture. IEEE, Barcelona, Spain, 132–141.

Martnez, J. F., Renau, J., Huang, M. C., Prvulovic, M., and Torrellas, J. 2002. Cherry:Checkpointed early resource recycling in out-of-order microprocessors. In Proceedings of the

35 th International Symposium on Microarchitecture. IEEE, Anchorage, Alaska, 3–14.

Moshovos, A. 2003. Checkpointing alternatives for high performance, power-aware processors.

In Proceedings of the 2003 International Symposium Low Power Electronic Devices and Design(ISLPED’03). ACM, Seoul, Korea, 318–321.

Rotenberg, E., Jacobsen, Q., and Smith, J. E. 1999. A study of control independence in su-

perscalar processors. In Proceedings of the 5 th International Symposium on High PerformanceComputer Architecture. IEEE, Orlando, FL, 115–124.

Smith, J. E. and Pleszkun, A. 1988. Implementing precise interrupts in pipelined processors.

IEEE Transactions on Computers 37, 5 (May), 562–573.

Sodani, A. and Sohi, G. S. 1997. Dynamic instruction reuse. In Proceedings of the 24 th AnnualInternational Symposium on Computer Architecture. IEEE, Denver, CO, 194–205.

Sohi, G. S. 1990. Instruction issue logic for high-performance, interruptible, multiple functional

unit, pipelined computers. IEEE Transactions on Computers 39, 3 (Mar.), 349–359.

Tyson, G., Lick, K., and Farrens, M. Limited dual path execution. CSE-TR 346-97, University

of Michigan.

Wallace, S., Calder, B., and Tullsen, D. 1998. Threaded multiple path execution. In Pro-

ceedings of the 25 th International Symposium on Computer Architecture (ISCA’98). IEEE,

Barcelona, Spain, 238–249.

Yeager, K. C. The mips r10000 superscalar microprocessor. IEEE MICRO 16, 2.

Zhou, P., Onder, S., and Carr, S. 2005. Fast branch misprediction recovery in out-of-ordersuperscalar processors. In Proceedings of the 19 th International Conference on Supercomputing.

ACM, Cambridge, MA, 41–50.


Documents

BranchTap: Reducing Branch Misprediction Penalty through …moshovos/research/branch... · 2006-08-24 · 2 · P. Akl and A. Moshovos This work proposes BranchTap, a novel checkpoint-aware