Tuning the Continual Flow Pipeline Architecture

SoC CAD

1

Tuning the Continual Flow Pipeline Architecture

徐子傑 H s u , Z i J e i

Department of Electrical EngineeringNational Cheng Kung University

Tainan, Taiwan, R.O.C

NCKU

SoC & ASIC Lab 2 Hsu, Zi Jei

SoC CADINTRODUCTION(1/5)

To improve superscalar processor performance on difficult to parallelize applications, architects have been increasing the capacity of reorder buffers, reservation stations (RS), physical register files, and load and store queues [22] with every new out-of-order processor core.

For two decades, increasing instruction buffer sizes has provided good performance improvement. However, this approach does not work anymore.

A different design strategy is to size the instruction buffers to the minimum capacity necessary to handle the common case of L1 data cache hit and to use new scalable out-of-order execution algorithms to handle code that misses the L1 data cache.

NCKU



The Continual Flow Pipeline architecture (CFP) [23] was proposed as an energy-efficient large instruction window architecture for reducing the impact of data cache misses on performance, without having to increase instruction buffers and physical register files sizes.

CFP handles data cache misses as follows.When a load misses the data cache, a poison bit is set in the

destination register of the load. Load dependent instructions in the reservation stations (RS) are

then woken up, as if the load completed.

NCKU



Poison bits propagate through instruction dependences, and identify all instructions that depend on the load miss and their descendants.

The miss load and its dependents, identified by the poison bits in the ROB, pseudo-commit in program order and move from the ROB into a waiting buffer (WB) outside the pipeline.

Since dependent instructions do not tie pipeline resources, the core can execute ahead far into the program without stalling due to the cache miss.

When the miss data is fetched, the dependent instructions wake up and replay from the WB into the pipeline to complete their execution.

When the WB is emptied and all miss dependent instructions complete,

independent and dependent instruction results are merged using a flash copy operation in the retirement register file. Execution then resumes normally.

NCKU



In that work, the miss independent and dependent instructions execute at different times, based on the timing of the load miss event and the data arrival event.

Switching between the two executions is costly because it involves a pipeline flush, making this proposal unsuitable for L1 misses that hit the on-chip cache.

In a more recent work, simultaneous CFP (S-CFP) [15] executes the independent and dependent instructions simultaneously to avoid the costly pipeline flush,

thus making S-CFP more suitable for first level data cache misses.

NCKU



In this paper, we use a novel virtual register renaming substrate [21] and fine tune the replay policies to mitigate excessive replays and rollbacks to the checkpoint.

On previous CFP architectures, the miss load releases its renamed destination register when it pseudo-commits.

This breaks the dependence links between the miss load and its dependents, requiring dependent thread to be renamed again to re-establish the dependence relation.

In this work, we use virtual register names, which persist for the full life time of the instructions, to specify the dependences between instructions.

It introduces an improved CFP policy that keeps miss dependent instructions in the reservation stations as long as they do not block the pipeline.

However, when the instruction buffers become full with miss dependent instructions, thus stalling the pipeline, CFP moves the miss dependent instructions into the waiting buffer.

NCKU


SoC CADContinual Flow Pipeline Architecture(1/3)

S-CFP Architecture OverviewFigure 1 shows a block diagram of S-CFP microarchitecture

Unlike previous latency tolerant out-of-order architectures, the S-CFP core executes cache miss dependent and independent instructions concurrently using two different hardware thread contexts.

S-CFP also has two retirement register file contexts (RRF), one for retiring miss independent instruction results and the other for retiring miss dependent instruction results.

The independent hardware thread is the main execution thread. It is responsible for instruction fetch and decode of all instructions,

branch prediction, memory dependence prediction, identifying miss dependent instructions and moving them into the waiting buffer (WB).

NCKU


SoC CADContinual Flow Pipeline Architecture(2/3) At the end of dependent execution, when all the instructions from

the WB have retired (i.e. committed) without any mispredictions or exceptions,

the independent and dependent execution results are merged together with a flash copy of the dependent and independent register contexts within the retirement register file.

The dependent thread execution starts when the load miss data is brought into the cache, waking up the load instruction in the WB, and continues until the WB empties.

To maintain proper memory ordering of loads and stores from the independent and dependent threads execution,

S-CFP uses load and store queues (LSQ), a Store Redo Log (SRL) [10] and a store-set memory dependence predictor [5].

NCKU


SoC CADContinual Flow Pipeline Architecture(3/3)

Figure 1. Simultaneous CFP architecture block diagram

WB : waiting bufferRAT : register alias tableRRF : register file contextsRS : reservation station SLR : store redo logLSQ : load and store queues

NCKU

SoC & ASIC Lab 10

Hsu, Zi Jei

SoC CAD

S-CFP and Tuned CFP Execution Examples(1/10)

Figure 2. Execution sequence showing S-CFP moving a dependent into WB eagerly

X : load miss hit in DRAMA : load miss hit in L2 cache

NCKU


SoC CAD

S-CFP Execution Examples: In Figure 2(a), the WB has a load miss X at the head waiting for

its wakeup. Instruction A misses the first level cache and is marked as a

potential candidate to be moved into the WB. When A reaches the head of the ROB, there are still free entries

available in the ROB.Figure 2(b) shows that the load miss hits the L2 data cache and

A is woken up from the L1 data cache shortly after it enters the WB.

However, it is stuck in the WB behind instruction X that has missed to DRAM. For a long time afterwards, and until the miss data of load X is fetched from DRAM.


NCKU

SoC & ASIC Lab 12

Hsu, Zi Jei

SoC CAD

Tuned CFP Execution Examples:Figures 2(c) and 2(d) show the ROB and WB states in the

Tuned CFP architecture. Instruction A misses the first level cache and is marked as

poisoned. However, unlike in S-CFP, it does not release its RS entry until it

becomes a blocking instruction, just in case the load hits the L2 cache providing the miss data to the CFP core shortly.

It will still be kept in the RS and ROB by stalling pseudo-retirement as long as there are free entries in the ROB and other instruction buffers for the pipeline to continue execution of other instructions without blocking.

If the miss data arrives before the pipeline blocks, A is woken up from the RS and ROB by clearing its poison bits, as shown in Figure 2(d).

A and its dependents do not require to go through the replay loop at all in this example, saving significant time delay and energy.


NCKU

SoC & ASIC Lab 13

Hsu, Zi Jei

SoC CAD


Figure 3. Execution sequence showing a scenario leading to rollback in S-CFP which is avoided in tuned CFP

X : load miss hit in DRAMA : load miss hit in L2 cacheF : Branch(mispredicted) depend on A

NCKU

SoC & ASIC Lab 14

Hsu, Zi Jei

SoC CAD

S-CFP Execution Examples:Figures 3(a)-3(c) show an execution sequence to illustrate a

situation in S-CFP that leads to a rollback to the checkpoint. Instruction A misses the L1 cache. In this example, F is a

branch that depends on A. A is moved into the WB from the head of the ROB as shown in Figure 3(a).

F also follows A into the waiting buffer, even though the wakeup for A arrives while F is still in the ROB/RS, as shown in Figure 3(b).

Both A and F are replayed behind instruction X as shown in Figure 3(c).

On replay, branch F is found to be mispredicted and branch misprediction recovery has to be performed by rolling back execution to the checkpoint,

since by then, the sequential state in the register file has been corrupted by the out-of-order pseudo-retirement of instructions during the cache miss processing.


NCKU

SoC & ASIC Lab 15

Hsu, Zi Jei

SoC CAD

Tuned CFP Execution Examples:Figures 3(d) and 3(e) show how the rollback situation is avoided

in the Tuned CFP architecture. Similar to the previous example, instruction A stays in the ROB,

even if it reaches the head, as long as it is not blocking execution.

A gets its wakeup before it moves into the WB, as shown in Figure 3(e). Even though F is a miss dependent and mispredicted branch, it executes before it pseudo-retires.

When it reaches the head of the ROB, the ROB flushes the pipeline to clear all the wrong path instructions that have been fetched after the branch, and signals to the fetch unit to restart fetch and execution from the corrected target.

The costly S-CFP branch recovery from the checkpoint has been avoided.


NCKU

SoC & ASIC Lab 16

Hsu, Zi Jei

SoC CAD


Figure 3. Execution sequence showing a scenario leading to rollback in S-CFP which is avoided in tuned CFP

A : load missB : dependent on Ax : don’t care

NCKU

SoC & ASIC Lab 17

Hsu, Zi Jei

SoC CAD

S-CFP Execution Examples:Figures 4(a)-4(d) show another execution sequence to illustrate

why S-CFP needs to replay a load and all its dependents once the load enters the WB.

In this example, A is a load miss and B is dependent on A. The two instructions are separated by miss independents shown as dotted lines.

In Figure 4(a), A reaches the head of the ROB. It pseudo-retires and moves into the WB, releasing all its pipeline resources including its ROB ID #3, as shown in Figure 4(b).

When A wakes up and replays, it is allocated a new entry at the tail of the ROB as shown in Figure 4(c).

Notice that A gets a new ROB ID #24 when it is reintroduced into the pipeline.


NCKU

SoC & ASIC Lab 18

Hsu, Zi Jei

SoC CAD

S-CFP Execution Examples: Because of this new ID, even though B is still in the RS and the

ROB while A is being replayed, A’s data writeback cannot wakeup B, because B still has the physical

register destination ID #3 as its source operand. B reaches the ROB head, pseudo-retires, and moves into the WB.

When B is replayed and reintroduced into the pipeline, it goes through the rename stage, gets a new ROB ID #28 and

receives the correct physical source register ID # 24, re-establishing its link with A from the dependent RAT, as shown in Figure 4(d).


NCKU

SoC & ASIC Lab 19

Hsu, Zi Jei

SoC CAD

Tuned CFP Execution Examples:Figures 4(e)-4(g) illustrate a partial replay in the Tuned CFP

architecture, representing the same scenario discussed earlier in Figure 4(a).

In Tuned CFP, virtual register IDs that are not associated with any physical locations are used for register renaming and in the RS wakeup and scheduling logic.

The virtual register IDs of instructions A and B are shown under their ROB entries in addition to the renamed source and destination virtual register IDs.

As before, when A reaches the head of the ROB, it pseudo-retires and moves into the WB, as shown in Figure 4(e).

However, unlike in S-CFP, A only releases its RS but still carries its virtual register ID #3 along with it into the WB, as shown in Figure 4(f).

Later on, when it wakes up and replays, A still carries with it its original virtual register ID #3, still maintaining its link with its dependent instruction B intact,

allowing B to be woken up and scheduled by the RS without having to replay to be renamed again, as shown in Figure 4(g).


NCKU

SoC & ASIC Lab 20

Hsu, Zi Jei

SoC CADTuned CFP Architecture Overview(1/9)

Figure 5 shows a block diagram of the Tuned CFP core. Tuned CFP uses Tomasulo’s algorithm [24] and reservation stations to perform data-driven, out-of-order execution.

Like all other superscalar architectures, Tuned CFP uses a reorder buffer to commit instructions and update architecture register and memory state in program order.

However, Tuned CFP does not use the reorder buffer for register renaming. Instead, it performs register renaming using virtual register IDs generated by a special counter.

These virtual register IDs are not mapped to any fixed storage locations in the core,

and therefore can be large in number and allocated to instructions throughout their life time, including miss dependent instructions evicted to the waiting buffers.

NCKU

SoC & ASIC Lab 21

Hsu, Zi Jei


Virtual register renaming gives Tuned CFP a significant advantage over previous CFP architectures.

Past CFP architectures require all miss dependent instructions to be replayed and renamed again to re-establish dependence links,

which is necessary for the reservation stations to re-dispatch the miss dependent instructions in correct data flow order.

In contrast, since the virtual register renaming IDs are permanent from the time the miss dependent instructions are renamed until they execute and commit, Tuned CFP can do partial replay of dependent instructions.

What this means is that if the load miss data is fetched from memory after the load is moved to the waiting buffer but before its dependents have been moved,

Tuned CFP replays only the load. This saves significant execution time that would be spent if all the miss dependent instructions in the reservation stations had to be replayed through the waiting buffer to be renamed again.

NCKU

SoC & ASIC Lab 22

Hsu, Zi Jei


Figure 5. Tuned CFP block diagram

WB : waiting bufferRAT : register alias tableRRF : register file contextsRS : reservation station SLR : store redo logLSQ : load and store queuesVID : Virtual Identification

NCKU

SoC & ASIC Lab 23

Hsu, Zi Jei


Miss Independent Execution:When an L1 data cache load miss occurs, a poison bit is set in

the destination reorder buffer entry of the load. Load dependent instructions in the reservation stations (RS)

capture the poison bit from the common write back data bus. They are scheduled by the reservation stations control logic for

pseudo-execution. Pseudo-execution of poisoned instructions does not actually use any

execution units. However, pseudo-execution consumes RS dispatch ports and

writeback bus cycles to propagate poison bits through instruction dependences and to identify all instructions in the reservation stations that depend on the load miss data.

After pseudo-execution, miss dependent instructions stay in their reservation stations until they are waken up for real execution when the load miss data arrives,

or until they are moved into the waiting buffer in case their resources are needed to unblock the execution pipeline and execute miss independent instructions.

NCKU

SoC & ASIC Lab 24

Hsu, Zi Jei


Replay Loop and Miss Dependent Execution:Figure 5 shows the reduced replay loop in Tuned CFP

consisting of two stages: the reservation stations (RS) and the waiting buffer (WB).

The waiting buffer basically acts as a second level storage for the reservation stations. With virtual register renaming, entries can be freely evicted from the RS to the WB and then loaded back again to the RS to be scheduled for execution at a later time.

Evicting miss dependent instructions to the WB on resource need basis significantly reduces the number of replayed instructions,

especially in the case of medium latency load misses, which are those that miss the L1 data cache but hit the on-chip L2 cache.

In case of a load miss to DRAM, it is often the case that the long miss latency causes the instruction buffers to fill up

When the load miss is serviced, the miss load and its dependents are re-inserted from the waiting buffer back to the reservation stations, from which they are scheduled for execution.

NCKU

SoC & ASIC Lab 25

Hsu, Zi Jei


Tuned CFP Reservation Stations:The poison bits propagate the dependences from L1 data cache

misses to later instructions in the program to identify instructions that may encounter long data cache miss delays.

These instructions are candidates to move to the waiting buffer to avoid pipeline stalls that could occur if any of the reservation stations, reorder buffer load queue or store queue arrays becomes full.

Four conditions are checked to determine if an instruction should be moved to the waiting buffer:

1) the instruction is at the head of the RS list, 2) the instruction is poisoned, 3) one of the RS, reorder buffer, load queue or store queue arrays

is full, and 4) every source operand of the instruction is either poisoned or

ready. The last condition ensures that the miss dependent instructions carry

their non-poisoned input values with them.

NCKU

SoC & ASIC Lab 26

Hsu, Zi Jei


Waiting Buffer:The waiting buffer is a wide single ported SRAM array managed

as a circular buffer using head and tail pointers. Miss dependent RS entries at the head of the RS array moves to

the tail of the waiting buffer when any of the instruction buffers fills up due to data cache misses.

When a data cache miss is completed, Tuned CFP replays the miss dependent entries by loading them back from the head of the waiting buffer to the tail of the RS.

These replayed instructions do not need to be renamed again. Their virtual register renames are still valid, thus can be used by the

RS to schedule these instructions and to grab their results from the writeback bus into the reservation

stations of any dependent instructions, including any instructions that have not been replayed but still waiting in the RS.

NCKU

SoC & ASIC Lab 27

Hsu, Zi Jei


Register File and Results Integration:Tuned CFP has a specialized register file for checkpointing

register state at the load miss, for later use to handle miss dependent branch mispredictions and exceptions.

Figure 6 shows Tuned CFP retirement register file cell with checkpoint flash copy support.

Tuned CFP uses a flash copy of the RRF for creating checkpoints. In one cycle every independent RRF state bit (leftmost latch) is shifted into a checkpoint latch within the register cell (center latch).

The register file can be restored from the checkpoint in one cycle by asserting RSTR_CLK.

Tuned CFP register file cell contains one context bit for the dependent RRF state (rightmost latch).

To integrate results back into one context, a restore cycle is performed from the dependent context into the independent context. However, not all registers are copied.

Figure 6 shows that only poisoned registers are copied by using the poison bits to enable the clock of the copy operation.

NCKU

SoC & ASIC Lab 28

Hsu, Zi Jei


Figure 6. RRF cell checkpoint and result integration

NCKU

SoC & ASIC Lab 29

Hsu, Zi Jei


Load and Store Execution in Tuned CFP:To maintain proper memory ordering of loads and stores from

the independent and dependent instructions execution, Tuned CFP, like previous CFP proposals, uses load and store

queues (LSQ), a Store Redo Log (SRL) [10] and a store-set memory dependence predictor [5].

All stores, dependent and independent, are allocated entries (and IDs) in the SRL in program order at the rename stage of the pipeline. Every load, dependent or independent, carries the SRL ID of the last prior store.

In order to support concurrent, speculative execution of dependent and independent loads and stores, Tuned CFP L1 data cache has two new states:

Speculative Independent (Spec_Ind) and Speculative Dependent (Spec_Dep).

A block that is not in one of these two states is considered to be committed and would be in one of the states defined by the cache coherence protocol, e.g. MESI coherence protocol.

NCKU

SoC & ASIC Lab 30

Hsu, Zi Jei

SoC CADSIMULATION RESULTS(1/7)

We built our Tuned CFP architecture model on the Simplescalar ARM ISA simulation infrastructure (www.simplescalar.com) and

used all 14 “C” benchmarks from SPEC 2000 and Spec 2006 that we succeeded in compiling using the Simplescalar cross compiler tool.

Table 1 shows the simulated machine configurations.

Table 2 and Table 3 show various relevant execution statistics of S-CFP and Tuned CFP.

Figure 7 shows the speedup of Tuned CFP over S-CFP and over a similar sized conventional superscalar core.

NCKU

SoC & ASIC Lab 31

Hsu, Zi Jei


Table 1. Simulated machine configurations

NCKU

SoC & ASIC Lab 32

Hsu, Zi Jei


Figure 7. Tuned CFP speedup over baseline and S-CFP

NCKU

SoC & ASIC Lab 33

Hsu, Zi Jei


Table 2. Tuned CFP execution statistics

NCKU

SoC & ASIC Lab 34

Hsu, Zi Jei


Table 3. S-CFP and Tuned CFP replay, rollback and wrong path statistics

NCKU

SoC & ASIC Lab 35

Hsu, Zi Jei


Figure 8. Power increase of S-CFP and Tuned CFP over baseline

NCKU

SoC & ASIC Lab 36

Hsu, Zi Jei


Table 4. S-CFP and Tuned CFP power consumption relative to baseline averaged over all benchmarks

NCKU

SoC & ASIC Lab 37

Hsu, Zi Jei

SoC CADCONCLUSION

This paper presents a Tuned Continual Flow Pipeline architecture that uses virtual register renaming and optimized replay policies to improve performance and

reduce replay loop circuit activity and checkpoint rollback execution compared to previous CFP designs.

Our Tuned CFP architecture improves performance and power consumption over previous CFP architectures by ~15% and ~9%, respectively.

Documents

Tuning the Continual Flow Pipeline Architecture