Achieving reliable system performance by fast recovery of branch miss prediction

Journal of Network and Computer Applications 35 (2012) 982–991

Contents lists available at ScienceDirect

Journal of Network and Computer Applications

1084-80

doi:10.1

$This

Nationa

Science� Corr

E-m

slim@h

journal homepage: www.elsevier.com/locate/jnca

Achieving reliable system performance by fast recovery of branchmiss prediction$

Min Choi a, Jong Hyuk Park b, Seungho Lim c, Young-Sik Jeong d,�

a School of Information and Communication Engineering, Chungbuk National University, Cheongju 361-763, South Koreab Department of Computer Science and Engineering, Seoul National University of Science and Technology, Seoul 139-742, South Koreac Department of Digital Information Engineering, Hankuk University of Foreign Studies, Yongin 449-791, South Koread Department of Computer Engineering, Wonkwang University, Iksan 305-701, South Korea

a r t i c l e i n f o

Article history:

Received 1 August 2010

Received in revised form

13 January 2011

Accepted 9 March 2011Available online 21 March 2011

Keywords:

System reliability

Fault-tolerant system

Branch misprediction recovery

45/$ - see front matter & 2011 Elsevier Ltd. A

016/j.jnca.2011.03.015

research was supported by Basic Science Re

l Research Foundation of Korea (NRF) funded

and Technology (2010-0022589 and 2010-00

esponding author.

ail addresses: [email protected] (M. Choi), jhp

ufs.ac.kr (S. Lim), [email protected] (Y.-S. Jeo

a b s t r a c t

Today’s technology evolution provides users inexpensive and powerful computer systems. However,

there are argues that system reliability and fault tolerance is necessary in the systems as well. A proper

design for the reliable and fault-tolerant computer system requires a trade-off among cost, reliability,

and availability. In this paper, we propose a low-cost recovery scheme for reliable system performance.

With this approach, it completely eliminates the roll-back overhead on branch misprediction. Thus, the

instruction fetcher does not stop and it fetches instructions from the correct path immediately after the

misprediction detected. So, this approach prevents a processor from flushing the pipeline, even under

branch misprediction by allowing the instruction fetcher to work continuously. Our approach reduces

the branch misprediction penalty for achieving reliable system performance. It instantly reconstructs

the map table to any mispredicted branch and it outperforms the conventional RMT by an average of

10.93%.

& 2011 Elsevier Ltd. All rights reserved.

1. Introduction

The rapid rise in the complexity of computer system is whyreliability and fault tolerance is so important in system design.Especially, branch prediction is well known technique for improv-ing performance in microprocessor. Although today’s branchpredictors show high accuracy, it is still a probabilistic approachand not perfect. Moreover, the misprediction penalty is gettinglarger due to aggressive speculation and deeper pipelining. Thepenalty severely affects reliable system performance. Because,this causes frequent performance drop, larger state recovery time,and low system reliability. Therefore, it must be handled properlyin order to achieve reliable system performance.

There are two main reasons for performance degradation fromthe branch mispredictions: cycles to resolve the branch, andcycles to recover the correct architectural state. We focus onreducing the recovery overhead of the above branch mispredic-tion penalty. The recovery process involves not only canceling theeffects of instructions that were speculatively executed, but alsoreestablishing the correct architectural state. Typically, branch

ll rights reserved.

search Program through the

by the Ministry of Education,

25748).

[email protected] (J.H. Park),

ng).

misprediction recovery requires flushing the pipeline, recon-structing the architectural state, and then restarting the processof instruction fetching and register renaming. The renaming ofcorrect path instructions will not restart until the register renamemap table corresponding to the mispredicted branch is restored.To address the above problems, the following approaches are themost common solutions for restoring the map table: the check-point processing and recovery (CPR) method (Akkary et al., 2003),the retirement map table (RMT) method (Hinton et al., 2001) andthe history buffer (HB) method (Ranganathan et al., 1997; Smithand Pleszkun, 1985). In the checkpoint repair method, backupcopies of the in-order state are created periodically either at everybranch or every few cycles. The RMT method uses a retirementmap table in addition to the frontend map table. When themispredicted branch reaches the retire point, the retirementmap table is copied to the frontend map table. The HBmethod uses a stack-like structure to maintain the in-order statesuperseded by speculative values. On branch misprediction, allmappings are then popped from the HB and updated into therename map table in reverse order. In fact, the Intel Pentium 4 usesthe RMT method to track register mappings. The Alpha 21264(Kessler, 1999) and MIPS R10000 (Yeager, 1996) use the CPRmethod to reestablish rename table. With the above three methodremaining the basis of branch recovery studies, several recentproposals for are developed as follows: eager misprediction recov-ery (EMR) method (Zhou et al., 2005) and the selective checkpoint-ing (Gandhi et al., 2004; Akkary et al., 2004; Cristal et al., 2005;

www.elsevier.com/locate/jnca

dx.doi.org/10.1016/j.jnca.2011.03.015

mailto:[email protected]




dx.doi.org/10.1016/j.jnca.2011.03.015

M. Choi et al. / Journal of Network and Computer Applications 35 (2012) 982–991 983

Kirman et al., 2005). The EMR method hides the latency of longbranch recovery by leveraging the instructions that access correctvalues to execute continuously, while forcing instructions thatreference incorrect speculative values to wait until the correct dataare restored. The selective checkpointing minimizes the CPR over-head and enables fast branch misprediction recovery by selectivelycreating checkpoints at low-confidence branches while conven-tional checkpointing allocates for all predicted branches.

In this paper, we overcome the complex recovery overhead byproposing a fast and efficient recovery mechanism. To this end, wepropose an incremental register renaming (IRR) which is distinctfrom conventional register renaming because it uses a differentreclaiming policy to rename registers. The IRR enforces the destina-tion register number of the instruction stream to appear in non-decreasing order. Then, we develop a fast and efficient structure,called bit-vector based rename map table (BVMT), to reduce thebranch recovery overhead. With the incremental property of the IRR,the BVMT recovery scheme completely eliminates the roll-backoverhead on branch misprediction. The proposed mechanism doesnot have to wait until all other pending instructions are completedbefore starting the recovery process. Thus, the instruction fetcherdoes not stop and it fetches instructions from the correct pathimmediately after the misprediction detected. The proposedmechanism makes checkpoints efficiently and instantly rolls themap table back to the correct state.

The rest of this paper is organized as follows. Section 2presents a brief review of the existing approaches related toregister renaming and rename table structures. Section 3 and4 provide the key idea of this paper such as the IRR strategy andthe BVMT structure. Section 5 evaluate the performance. Finally,we conclude by summarizing our results in Section 6.

2. Related work

The related literature will be discussed in this section. Webegin with describing the renaming table structures and then webriefly introduce the various related works to branch mispredic-tion recovery.

There are two possibilities for keeping track of the actualmapping of particular architectural registers to allocated renamebuffers: SRAM1 structured RMT and CAM2 structured RMT (Sima,2000). The SRAM structured RMT has many entries as there arearchitectural registers provided by the instruction set architec-ture. Each valid entry supplies the index of the rename buffer,which is allocated to the architectural register belonging to thatentry. The CAM structured RMT relies on an associative mechan-ism to maintain the actual register mappings. In this case nomapping table exists, but each rename buffer holds the identifierof the associated architectural register and additional status bits.

The most common solutions for restoring the map table are asfollows: the CPR method (Akkary et al., 2003), the RMT method(Hinton et al., 2001), and the HB method (Ranganathan et al.,1997; Smith and Pleszkun, 1985). In the checkpoint repairmethod, backup copies of the in-order state are created periodi-cally either at every branch or every few cycles. This methodrequires a large amount of storages to save the in-order state foreach branch. Moreover, periodic checkpointing is impractical to

1 SRAM stands for static random access memory. It is a type of semiconductor

memory where the word static indicates that, unlike dynamic RAM (DRAM), it

does not need to be periodically refreshed, as SRAM uses bistable latching circuitry

to store each bit.2 Content-addressable memory (CAM) is a special type of computer memory

used in certain very high speed searching applications. It is also known as

associative memory, associative storage, or associative array, although the last

term is more often used for a programming data structure.

implement because a tremendous number of checkpoints may berequired as instruction window scales up. The RMT method uses aretirement map table in addition to the frontend map table. Oncea misprediction is resolved, the processor applies the renamemapping for each in-flight instructions into the RMT. When themispredicted branch reaches the retire point, the retirement maptable is copied to the frontend map table. This technique wastesseveral cycles because the recovery process can only be initiatedafter all instructions prior to the mispredicted branch are retired.In particular, when a cache miss occurs before the mispredictedbranch reaches the head of the reorder buffer, this approachresults in delays and performance degradation. The HB methoduses a stack-like structure to maintain the in-order state super-seded by speculative values. The current mapping of the destina-tion register is pushed onto the stack for each instruction. Onbranch misprediction, all mappings are then popped from the HBand updated into the rename map table in reverse order. Thismethod requires several cycles to reconstruct the register maptable and renaming of new instructions is delayed.

The EMR method (Zhou et al., 2005) and the selective check-pointing (Gandhi et al., 2004; Akkary et al., 2004; Cristal et al.,2005; Kirman et al., 2005; Lee and Zomaya, 2009; Sorin et al.,2000; Moshovos, 2003) are the variations of the previous works.The EMR method hides the latency of long branch recovery byleveraging the instructions that access correct values to executecontinuously, while forcing instructions that reference incorrectspeculative values to wait until the correct data are restored. Thisoptimization improved better performance than traditionalsequential misprediction recovery. However, as instruction win-dow gets larger, the distance between the most recent checkpointand the current instruction pointer increases. Many instructionshave to be re-executed. Moreover, latencies are still required torecover for the instructions that use incorrect speculative values.The selective checkpointing minimizes the CPR overhead andenables fast branch misprediction recovery by selectively creatingcheckpoints at low-confidence branches while conventionalcheckpointing allocates for all predicted branches. The results ofcorrect path instructions can be reused. Thus, the recoverypenalty is reduced since those instructions do not need to befetched/renamed again. However, the selection of low-confidencebranches depends on prediction. The prediction is subject tovarying degrees of uncertainty and is not reliable enough.

3. Incremental register renaming

Register renaming is a technique to avoid unnecessary serial-ization of program operations imposed by the reuse of the samedestination registers. The register renaming resolves false datadependencies – anti-dependency and output dependency – thatoccur in code segment between operands of subsequent instruc-tions. The core of the renaming activity takes place in the registermap table (RMT). The RMT keeps track of the register mappingfrom architectural registers to logical rename registers. Inthis paper, we propose incremental register renaming (IRR) tomake the recovery process quick and efficient. Figure 1 shows anexample of IRR for a real code sequence (a part of quick sort)assuming a MIPS-like instruction set architecture (ISA).

The IRR is distinct from conventional register renaming becauseit uses a different reclaiming policy to rename registers. Even thougha recently freed physical register designated by a preceding instruc-tion is available, the IRR does not reuse the register for mapping ofthe next instruction. The IRR maps the architectural registers of eachinstruction to a rename register sequentially. Thus, the IRR enforcesthe destination register number of the instruction stream to appearin non-decreasing order. Using the IRR, the rename registers are

M. Choi et al. / Journal of Network and Computer Applications 35 (2012) 982–991984

allocated sequentially and contiguously on the list of renamebuffers between head and tail pointers. When the pointer registersoverflow, the register values reset to zero. Therefore, the allocation

Fig. 1. An example of incremental register renaming and the usage of bit-vector

based RMT.

Fig. 2. Conventional pipeline flushin

Fig. 3. BVMT

of rename buffers proceeds in round-robin order. In the renamebuffer allocation, the cycle length is the same as the number ofrename registers. Because, at the end of the rename buffers theallocation wrap back around to the first object. This incrementalproperty of the IRR has great potential to eliminate the roll-backoverhead on branch misprediction. We take advantage of the IRR tocreate low-cost recovery of rename map tables.

4. A low-cost recovery of incorrect speculative state

4.1. Why flushing pipeline on branch recovery?

When a branch misprediction occurs, retiring all instructionsprior to the mispredicted branch is the state of the art solution tothe branch recovery problem. After recovering from the branch,the processor resumes executing instructions on the correct path.However, this solution severely affects the performance, becauseit requires much time to recover the speculative machine state tothe correct state. The recovery process involves squashing allinstructions on the mispredicted path and restarting fetchand renaming instructions from the correct path. Assuming aprocessor which exploits the RMT (Sima, 2000) method, it uses a

g due to branch misprediction.

recovery.


retirement map table in addition to the frontend map table. Oncea misprediction is resolved, the processor continues to executethe instructions prior to the mispredicted branch as illustratedthrough (a) to (c) in Fig. 2. It applies the rename mapping for eachin-flight instruction into the RMT. When the mispredicted branchreaches the retire point, the retirement map table is copied to thefrontend map table. This technique wastes several cycles becausethe recovery process can only be initiated after all instructionsprior to the mispredicted branch are retired. Moreover, anotherproblem is that the pipeline is flushed at this time as shownin (d) of Fig. 2. This incurs severe performance degradation.

Once the misprediction resolved, the lower part (wrong path)of (a) in Fig. 2 will be squashed out from the pipeline. Fortunately,the operation to squash those instructions is not expensive, sincejust cleaning the valid bits is all to do. However, the top part(correct part) of the figure must be executed regardless of themisprediction resolution. In the mean while, the instructionfetcher and the register renaming cannot work at all. This isbecause the rename map table is still in an incorrect speculativestate and the recovery of the table is able to be reconstructed onlywhen the mispredicted branch reaches the retire point. Whilesquashing instructions on the mispredicted path is necessary forcorrect execution, flushing pipeline is not inevitable. Althoughflushing pipeline on branch misprediction is just an adverse effectand avoidable, conventional processors still suffer performancedegradation due to the late recovery. From the following section,

Fig. 4. BVMT encoding scheme.

Fig. 5. The BVMT d

we propose a novel structure for rename map table, called BVMT,to avoid the unnecessary pipeline flushing.

4.2. The BVMT structure

To handle recovery from exceptions or branch mispredictionsquickly and efficiently, we change the conventional rename maptable structure to the bit-vector based map table (BVMT). TheBVMT has head and tail pointers like a circular queue. The tailindex is increased by one for a mapping of each instruction andthe head index is increased when the instruction retires. Aprocessor writes new register mappings sequentially to the tailof the BVMT. Figure 3 demonstrates the update and recovery ofthe BVMT at each renaming step. In the BVMT structure, each rowcorresponds to the architectural register number and the bits in acolumn are translated to the renamed register number. If weassume that the misprediction occurs on bgez, then the processorjust cleans all bits after the 8th column as like in Fig. 3, sincethe BVMT guarantees that those are the only modified parts afterthe branch.

That is all we need to recover from a misprediction usingBVMT structure. For the sake of simple explanation, Fig. 4 showsonly the third row in part of the BVMT. The left side of the figurerepresents the fact that the architectural register 3 is mapped tothe renamed register 3, initially. After that, two subsequentinstructions that need architectural register 3 use the renamedregister 9 and 10, respectively. The mapping to the 10 is the mostup-to-date value at the time. The right side of the Fig. 4 showsthe state after the branch recovery. It indicates that the mostup-to-date state of the mapping is from 3 to 3. This is because thelast two subsequent mappings are squashed during the recoveryprocess.

To support BVMT, we need to consider the decoding logic.From the Fig. 4, it is trivial to exploit a priority decoder so that thelogic can translate the location of a bit set to a physical registernumber. Figure 5 represents an entire structure of the BVMTdecoder on the left side, whilst the right side of the figure showsthe detailed logic of a priority decoder. Because we set the renamebuffer size to 32, we need 32-to-1 decoder. But, due to the logiccomplexity of 32-to-1 decoder, we use four 8-to-1 decoders andone 4-to-1 decoder. On the left side of Fig. 5 we can see that4 decoders which get totally 32 bit arrays as input from a BVMT

ecoding logic.

8

10

12

Usa

ge

BVMTconventional


row and output the 4 bit arrays. The 4 bit array is again go to4-to-1 decoder. BVMT decoder is easily implemented by 32-to-1decoders since its logical simplicity. Especially, it is prioritydecoder because the right-most bit in the binary representationof a BVMT row is only up-to-date. Here, we use the general four8 bit and one 4 bit priority decoders in which no poweroptimization techniques are exploited. If we adopt various opti-mization techniques such as Kun et al. (2004), the priorityencoders will be faster and more power efficient. We measurethe additional cost of the priority decoder in terms of powerdissipation, area estimation, and access delay as shown in Table 1.The supplementary logic has been implemented with Verilog HDLand synthesized with the Synopsys Design Compiler (SynopsysDesign Compiler, 2010) targeted towards a 0.18 micron TSMClibrary. For the sake of simplicity, the logic structure is designedconsidering only a single row of BVMT.

Table 1 shows the analytical statistics from synthesizing the8 to 1 BVMT decoder. A net is a network of cells linked by wiresconnecting their ports. Given several occurrences of cells, we usewires to connect their ports, and build a net. The power dissipa-tion in a CMOS circuit can be expressed as the sum of roughly twocomponents (Choi and Maeng, 2008): (1) Static power dissipation(due to the static or continuous current drawn from the powersupply) (2) Dynamic power dissipation (due to the switchingcurrents required to charge and discharge capacitors). Since thelogic is composed of combinational logic, the overhead is negli-gible compared to accessing an SRAM cell or checkpointing anentire map table contents. The overhead is also negligible com-pared to the retirement map table method in which double maptables are required.

Fig. 6. The BVMT eliminates pipeline flush

Table 1The logic overhead of the BVMT support structure.

Area Power Timing (ns)

Num. of ports 38 Dynamic 909.6226 uW Min 0.74

Num. of nets 88 Leakage 2.5549 nW Max 1.49

Num. of cells 56

4.3. The advantages of the BVMT structure

The BVMT offers the following key advantages: update historypreservation and low-cost recovery. First, the update history inthe BVMT is entirely maintained within a very small storagespace, even though register mappings of in-order state are super-seded by new speculative register mappings.

To this end, we exploit the IRR to store the new registermappings incrementally without replacing the previous mappinginformation. Thus, a processor can rollback to any register map tablestate corresponding to a certain branch. Second, the BVMT elim-inates the reconstruction overhead of register map table by fast-recovery of the machine state to the last consistent point of anexception or a branch misprediction. The recovery in BVMT iscompleted in a single cycle by squashing the bit set of successorinstructions. The squashing operation can be done instantly usinginexpensive custom circuitry, such as gang-cleaning (Martinez et al.,2002). This low-cost recovery is possible because the BVMT guar-antees that register mappings preceding a branch will never changeand that new register mappings for succeeding instructions will beincrementally added after the location of branch (Fig. 6).

Figure 7 shows the number of valid instructions in despite ofbranch miss prediction. As shown in this Figure, BVMT approachhas more valid instruction in pipeline even though branch

ing even under branch misprediction.

0

2

4

6

Ave

rage

rate

of R

OB

Fig. 7. The BVMT decoding logic.


misprediction occurs. Whereas, pipeline flushing on conventionalapproach results in pipeline squashed totally except only thebranch instruction. So, the curve in conventional approach steeplyslops down and stays at 1 ROB usage after branch mispredictionas depicted in Fig. 7.

Table 2Architectural parameters.

Architectural configuration

Fetch, decode, issue, commit 4, 4, 4, 4

rn:size 32

rob:size 32

ialu, imult, fpalu, fpmult 8, 2, 2, 2

Branch prediction

bpredictor 2lev. Bimodal, 2 K,

btb 128, 4-way

Return address stack 16 entry

Branch resolution 2 cycles

Mispred penalty 2 cyclesþa

Memory hierarchy

L1 D-cache, L1 I-cache 1024 sets, 32B blocks, 2-way, LRU

L2 cache 2048, 64B blocks, 4-way, LRU

Memory latency 20 cycles

5. Experimental results

All tests and evaluations were performed with programs fromthe Spec CPU2000 benchmark (The Standard PerformanceEvaluation Corporation) suite on Simplescalar. The SimpleScalartool set is a system software infrastructure used to build model-ing applications for program performance analysis, and detailedmicroarchitectural modeling. Using the SimpleScalar tools, userscan build modeling applications that simulate real programsrunning on a range of modern processors and systems. TheSimpleScalar tools are used widely for research and instruction,for example, in 2000 more than one third of all papers publishedin top computer architecture conferences used the SimpleScalartools to evaluate their designs. The Simplescalar is a cycleaccurate and architecture-level simulator. The simulator modelsthe pipeline architecture of the Alpha 21,264 shown in Fig. 8.

The Spec benchmarks are compiled and statically linked forthe Alpha instruction set and include all linked libraries. Foreach program, we skip the first 1 million instructions to avoidunrepresentative behavior at the beginning of the program’sexecution. This is because the initialization parts of applicationsare very different from the rest of them (Sair and Charney, 2000).The architecture parameters used in our simulations are listedin Table 2. Three models with different misprediction recoveryschemes are evaluated and compared: the retirement maptable (RMT), the history buffer (HB), and the bit-vector basedrename map table (BVMT). The parameters such as iw:size,stt:size, and rn:size in Table 2 are only available in DLT archi-tecture. The iw:size means the entry size of issue queue. Thestt:size is the queue entry size in STT. We represented renamingsize as shortly, rn:size. It means the number of rename buffers. Inour experiments, we set the size of rename buffers to 32.

Figures 9 and 10 show how much the performance (IPC) of eachbenchmark would improve assuming the three different architec-tures of rename map table. As can be seen from the figures, theBVMT outperforms the conventional RMT across all benchmarks byan average of 10.93%, while the RMT shows better performance thanthe HB method. The performance enhancement comes from the factthat the BVMT method saves clock cycles to repair the map tables,whereas the RMT and the HB require additional periods for the maptable recovery from a few cycles to tens of cycles. In particular, the

RRwithBVMT

Fetch

Issue Queue

Reo

Decode

Fig. 8. Machin

BVMT is more effective for integer applications than for floatingpoint applications. This is because floating-point applications haverelatively better branch prediction accuracies in most cases and thusbranch misprediction occurs more frequently on integer applica-tions. In general, the higher the taken frequency, the higher theprediction accuracy. From Cheng’s research, we see that the floating-point applications have higher taken branch frequency. Thissuggests that floating point applications are easier to predict thebranch than the integer programs do in terms of the integerprograms’ low frequency of branches taken.

To analyze the experimental result, we provide complexityanalysis for map table operations in Table 3. For write operationat the commit stage, the HB and the RMT commonly haveO(2) complexity. This is because they have to update not onlythe rename map table, but also the history buffer and theretirement map table in the HB and the RMT, respectively.However, the BVMT only has to clear only one bit, resulting inO(1). For recovery operation, the HB pops all mappings from theHB and updates into the rename map table in reverse order. Thus,in the worst case scenario all of the elements in the map tableneed to be updated and the time complexity of HB method isO(n). In the RMT method, the retirement map table is copied tothe frontend map table when the mispredicted branch reachesthe retire point. Therefore, the asymptotic time complexity ofRMT method is O(n). In addition to the copying overhead, theprocessor must wait for additional cycles until all other pendinginstructions are completed. However, the recovery in the BVMTworks by the gang-clear operation, clearing all bits after theposition of the mispredicted branch. This means that the com-plexity is O(1). Moreover, one main advantage to the BVMT

rder Buffer

FU RF

DataCache

LSQCommon Data Bus

Commit

e model.

0

0.5

1

1.5

2

2.5

3

vortex vpr gcc bzip2 gap twolf mcf eon crafty parser gzip avg.

BVMT RMT HB

IPC

Fig. 9. IPC (instructions per cycle) results for integer applications.

0

0.5

1

1.5

2

2.5

3

ammp apsi equake fma3d lucas mgrid swim art galgel mesa avg.

BVMT RMT HB

IPC

Fig. 10. IPC (instructions per cycle) results for floating point applications.

Table 3Complexity analysis.

BVMT Checkpoint HB RMT

Write at dispatch O(1) – O(1) O(1)

Write at commit O(1) – O(2) O(2)

Checkpoint O(1) O(n) – –

Recovery O(1) O(n) O(n) O(n)


approach is that it does not cause the instruction fetcher to stopand the register renaming is restarted from the correct path rightafter the misprediction detection.

To help understand the performance results for the threemodels, Figs. 11 and 12 show the ROB occupancies. The ROBoccupancy represents the average number of in-flight instructionsin a processor pipeline for each execution. A larger number of theROB occupancy means a higher utilization of the processor pipe-line. Recall as discussed in Section 4.1, the frontend includinginstruction fetcher and register renaming unit stalls when thebranch misprediction is detected. This results in flushing pipelineand performance degradation. Thus, the changes in the ROBoccupancy largely correlate with the IPC changes. Actually, theIPC improvement on the BVMT in Figs. 9 and 10 is accompaniedby higher ROB occupancy in Figs. 11 and 12 and vice versa.Especially for vpr, twolf, crafty, ammp, fma3d, and mesa, there are

relatively large improvement between the BVMT and the RMT.This is because these applications are more sensitive than theothers to the misprediction recovery mechanisms. Those applica-tions are classified into subsets of SPEC CPU2000 programs usingthe data locality characteristics, as show in p. 773 of Joshi’s paper(Joshi et al., 2006).

In addition, we present some additional results in Figs. 13 and 14.The graphs show the average rate of the ROB usage when branchmispredictions are detected. The left hand side in Figs. 13 and 14represents the results based on the RMT method, while the graphson the right hand side represent the results using the BVMT method.In graphs (a), (c), (e), and (g) of Figs. 13 and 14, there are steepgradients with a sharp decline due to the pipeline flushing on everybranch misprediction. On the other hand, the graphs (b), (d), (f),and (h) frequently show a gradual increase or a plateau in the ROBusage rate, since the BVMT allows instruction fetcher continue towork even on branch misprediction.

In order to analyze more detail of the graphs in Figs. 13 and 14,we provide a simplified graph that is average or typical behavior andthe plot has been smoothed for clarity as shown in Fig. 15. ThisFigure shows the behaviors for a branch misprediction event. At thebeginning of the program execution, instructions enter the instruc-tion window. Then, at some point, the branch misprediction occurs.At that point, the window begins to drain of the dependentinstructions following the mispredicted branch, because of the

0

5

10

15

20

25

30

35

vortex vpr gcc bzip2 gap twolf mcf eon crafty parser perlbmk gzip avg.

RMT BVMT

RO

B o

ccup

ancy

Fig. 11. The ROB occupancy for integer applications.

0

5

10

15

20

25

30

35

ammp apsi equake fma3d lucas mgrid swim art galgel mesa avg.

RMT BVMT

RO

B o

ccup

ancy

Fig. 12. The ROB occupancy for floating point applications.

05

101520253035

05

101520253035

05

101520253035

05

101520253035

vpr (RMT)

vortex (RMT)

vpr (BVMT)

vortex (BVMT)

Fig. 13. The ROB occupancy for integer and floating point applications.


05

101520253035

05

101520253035

05

101520253035

05

101520253035

fma3d (RMT)

crafty (RMT)

fma3d (BVMT)

crafty (BVMT)

Fig. 14. The ROB occupancy for integer and floating point applications.

Branch misprediction occurs

Pipeline refillPipeline refill(instructions enter issue queue)

Pipeline squashed

Pipeline refillPipeline refill(instructions enter issue queue)

Pipeline squashed

Branch misprediction occurs

Fig. 15. Branch misprediction behavior between conventional and BVMT approaches.


pipeline flushing. An important observation is that branch mispre-diction coincides with window drain; i.e., the mispredicted branchinstruction is one of the last valid instructions to be executed. Figure15 shows the difference between the window drain pattern. Here,the emphasis is on slop gradient during the period from ‘‘branchmisprediction occurs’’ to ‘‘pipeline squashed’’. The steep slop ofFig. 15(a) represents that the utilization of pipeline is rapidlydeclined. Whereas, the gradual gradient of Fig. 15(b) indicates thefact that the processor does not stop fetching during the recoveryprocess even on branch misprediction.

6. Concluding remarks

In this work, we prevent a processor from flushing the pipelineeven under branch misprediction by allowing the instruction fetcherto work continuously. To this end, we propose a fast and low-costbranch recovery scheme using the incremental register renaming andthe bit-vector based rename map table. The BVMT instantly recon-structs the map table corresponding to any branch and rolls back tocorrect state, so that the frontend does not stall during the recoveryprocess. Consequently, our approach enables the frontend to fetchinstructions from the correct path immediately after the mispredic-tion detected.

Appendix

BVMT
Bit-vector based rename map table CPR Checkpoint processing and recovery method RMT Retirement map table EMR Eager misprediction recovery IRR Incremental register renaming HB History buffer method CAM Content addressable memory SRAM Static random access memory iw:sizea Issue queue entry size stt:size Snoopy tag translator size rn:sizeb Renaming size ialu:size Integer ALU imult Integer multiplication fpalu Floating-point ALU fpmult Floating-point multiplication bpredictor Type of branch predictor btb Branch target buffer
a The parameters such as iw:size, stt:size, and rn:size are only available in DLT

architecture.b In our experiments, we set the size of rename buffers to 32.


References

Akkary H, Rajwar R, Srinivasan S. Checkpoint processing and recovery: towardsscalable large instruction window processors. In: IEEE/ACM InternationalSymposium on Microarchitecture (MICRO); 2003. p. 423–34.

Akkary H, Rajwar R, Srinivasan S. An analysis of a resource efficient checkpointarchitecture. ACM Transactions on Architecture and Code Optimization2004;1(4):418–44.

Cristal A, Santana O, Cazorla F, Galluzzi M, Ramirez T, Pericas M, et al. Kilo-instruction processors: overcoming the memory wall. IEEE Micro 2005;25(3):48–57.

Cheng C. The schemes and performances of dynamic branch prediction. TechnicalReport, 2000.

Choi M, Maeng S. An energy efficient instruction window for scalable processorarchitecture. IEICE Transactions on Electronics 2008;E91-C(9).

Gandhi A, Akkary H, Srinivasan T. Reducing branch misprediction penalty viaselective branch recovery. In: IEEE international conference of high perfor-mance computer architecture (HPCA); 2004. p. 254–64.

Hinton G, Sager D, Upton M, Boggs D, Carmean D, Kyker A, et al. The Micro-architecture of the Pentium 4 processor. Intel Technology Journal 2001;5(1).

Joshi A, Phansalkar A, Eeckhout L, Jone LK. Measuring benchmark similarity usinginherent program characteristics. IEEE Transactions on Computers 2006;55(6).

Kessler R. The alpha 21264 microprocessor. IEEE Micro 1999;19:24–36.Kirman N, Kirman M, Chaudhuri M, Martinez J. Checkpointed early load retire-

ment. In: IEEE international symposium on high-performance computerarchitecture (HPCA), 2005.

Kun C, Quan S, Mason A. A power-optimized 64-bit priority encoder utilizingparallel priority look-ahead. In: IEEE international symposium on circuits andsystems (ISCAS); 2004. p. 753–6.

Lee Y, Zomaya AY. On effective slack reclamation in task scheduling for energyreduction. Journal of Information Processing Systems 2009;4(4).

Martinez J, Renau J, Huang M, Prvulovic M, Torrelas J. Cherry: checkpointed earlyresource recycling in out-of-order microprocessors. In: IEEE/ACM interna-tional symposium on microarchitecture (MICRO); 2002.

Moshovos A. Checkpointing alternatives for high performance, power-awareprocessors. In: IEEE international symposium on low-power electronics anddesign (ISPLED), 2003.

Ranganathan P, Pai V, Adve S. Using speculative retirement and larger instructionwindows to narrow the performance gap between memory consistencymodels. In: ACM symposium on parallel algorithms and architectures (SPAA);1997. p. 199–210.

Sair S, Charney M. Memory behavior of the SPEC2000 benchmark suite. TechnicalReport, IBM; 2000.

Sima D. The design space of register renaming techniques. IEEE Micro 2000;20(5):70–83.

Simplescalar /http://www.simplescalar.com/S.Smith J, Pleszkun A. Implementation of precise interrupts in pipelined processors.

In: IEEE international symposium on computer architecture (ISCA); 1985.Sorin D, Martin M, Hill M, Wood D. Fast checkpoint/recovery to support kilo-

instruction speculation and hardware fault tolerance. University of WisconsinMadison, Computer Science Deparment, Technical Report, CS-TR-2000-1420;2000.

Synopsys Design Compiler /http://www.synopsys.com/S; 2010.The Standard Performance Evaluation Corporation /http://www.spec.org/S.Yeager K. The MIPS R10000 superscalar microprocessor. IEEE Micro 1996;16(2):

28–40.Zhou P, Onder S, Carr S. Fast branch misprediction recovery in out-of-order

superscalar processor. In: ACM international conference on supercomputing(ICS), 2005.

http://www.simplescalar.com/

http://www.synopsys.com/

http://www.spec.org/

Documents

Achieving reliable system performance by fast recovery of branch miss prediction