EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

EEAL: Processors’ Performance EnhancementThrough Early Execution of Aliased Loads

Abhishek Rajgadia, Newton and Virendra Singh

Computer Architecture and Dependable Systems LabDepartment of Electrical EngineeringIndian Institute of Technology Bombay

Mumbai, India

Presenter: Ankit Jindal

27th ACM Great Lakes Symposium on VLSI (GLSVLSI)

10th May, 2017

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 1 / 27

Introduction

Improving single thread performance has been a challenge

Frequency scaling responsible for exponential improvements insingle thread performance has stopped.Saturation in extraction of Instruction Level Parallelism (ILP).

Efficient execution of memory instructions can be an option toincrease single thread performance.

Memory instructions comprise of 20-30% of total instructions.1

Load instructions being at the top of critical path should beexecuted as early as possible.

Goal: To execute load instructions at the earliest.

An Out-of-Order processor uses Memory Dependence Predictor(MDP) to execute loads as early as possible.

1Shen John Paul, and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar

Processors. Waveland Press, 2013.


Introduction


Frequency scaling responsible for exponential improvements insingle thread performance has stopped.

Saturation in extraction of Instruction Level Parallelism (ILP).









Introduction











Introduction











Introduction











Introduction











Introduction











Introduction











Motivation

I1: LDR R1, [R2]; –Miss (assumption)I2: ADD R3, R4, R1; –Stalled, dependent on I1I3: STR R3, [R4]; –Stalled, R3 is not readyI4: LDR R5, [R4]; –Stalled, Memory-aliases with I3

For early execution of loads: An OoO CPU with MDP will issue I4before I3

However, LDR in I4 memory-aliases with STR in I3.Performance penalty due to this misprediction by MDR.

Register R4 used for address computation was ready, still MDPmade a misprediction.

In conventional OoO processor: I3 and I4 wait in RS for theirsource operands be ready before being issued to AddressComputation Stage (ACS).

Can we issue I3 and I4 early for their address computation?


Motivation








Motivation



However, LDR in I4 memory-aliases with STR in I3.

Performance penalty due to this misprediction by MDR.





Motivation








Motivation








Motivation








Motivation








Execution of Memory Instructions

Execution of memory instruction has 3 stages:

1 Address Computation (Addr)2 Address Translation (TLB)3 Memory Access (Mem)

Memory instructions wait in RS to respect true data dependencies.



Execution of memory instruction has 3 stages:1 Address Computation (Addr)

2 Address Translation (TLB)3 Memory Access (Mem)




Execution of memory instruction has 3 stages:1 Address Computation (Addr)2 Address Translation (TLB)

3 Memory Access (Mem)




Execution of memory instruction has 3 stages:1 Address Computation (Addr)2 Address Translation (TLB)3 Memory Access (Mem)




Execution of memory instruction has 3 stages:1 Address Computation (Addr)2 Address Translation (TLB)3 Memory Access (Mem)



Proposed Modification in Memory Pipeline

RS

Addr

TLB

Mem

1



RS

Addr

TLB

Mem

RS

Addr

TLB

Mem

1 2



RS

Addr

TLB

Mem

RS

Addr

TLB

Mem

1st RS

Addr

TLB

Mem

2nd RS

1 2 3


Early Execution of Aliased Loads

If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]

I4: LDR R5, [R4]

In our modified architecture

I3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.

Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.

Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.




I4: LDR R5, [R4]

In our modified architecture

I3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.






I4: LDR R5, [R4]

In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.

I4 can be issued if R4 is ready.






I4: LDR R5, [R4]

In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.






I4: LDR R5, [R4]


Such instructions with their addresses calculated wait in 2nd RS.

Aliasing store/load pairs present in 2nd RS offer followingadvantages.





I4: LDR R5, [R4]







I4: LDR R5, [R4]



Data forwarding from aliasing in-flight store to corresponding load.

Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.




I4: LDR R5, [R4]



Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.

Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.




I4: LDR R5, [R4]



Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.

Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.




I4: LDR R5, [R4]





Effect of Decreasing Execution Latency

0.9

0.95

1

1.05

1.1

1.15

bzip2m

ilchm

mer

lbm libquantum

mcf

namd

sjengsoplex

zeusmp

gmean

Spe

edup

rela

tive

to 4

Cyc

les

3 Cycles4 Cycles5 Cycles

Maximum gain of 11.53%, maximum loss of 7.57%.

Thus, elimination of one stage (which can be achieved by 2-levelRS architecture) can have huge impact on performance.


Proposed Architecture with 2-Level RS

Forwarding data from aliased store to load is done in 2nd RS itself.

Bypassing of remaining stages for such forwarded loads.


Early Store To Load Forwarding

I3: STR R3, [R4]

I4: LDR R5, [R4]



I3: STR R3, [R4]

I4: LDR R5, [R4]



I3: STR R3, [R4]

I4: LDR R5, [R4]



I3: STR R3, [R4]

I4: LDR R5, [R4]



I3: STR R3, [R4]

I4: LDR R5, [R4]



I3: STR R3, [R4]

I4: LDR R5, [R4]



I3: STR R3, [R4]

I4: LDR R5, [R4]


Potential Cases For Early Forwarding

Percentage of 100 million instructions that could have bypassedAddress Translation and Memory Access stages.

Benchmark % Cases

bzip2 2.2

mcf 1.33

lbm 0.79

milc 0.53

cactus 0.2

soplex 0.05


Evaluation Configuration

Simulation done using gem5 simulator for 64-bit ARM ISA

18 SPEC CPU2006 benchmarks were used in the study.

1B instructions were fast-forwarded and 100M were executed indetailed mode.

We compared Early-Fwd EEAL architecture with 3 baselines.

Parameters Baselines Early-Fwd

Aggressive Less Aggressive

6issue-4exe 4issue-4exe 4issue-2exe 4issue-2exe

Fetch Width 4

Issue Width 6 4 4 4

Commit Width 4

Mem Exe Units 4 4 2 2

RS 128

Reorder Buffer 192

LSQs 32


Results

0.99

1

1.01

1.02

1.03

1.04

1.05

1.06

1.07

1.08

bzip2m

cfm

ilch264ref

libquantum

astarom

netpp

xalancbmk

sjengbwaves

zeusmp

cactusADM

calculix

GemsFDTD

lbm namdsoplex

tontogm

ean

Nor

mal

ized

Spe

edup

Less Aggressive: 4issue-2exeLess Aggressive: 4issue-4exe

Aggressive: 6issue-4exe

Early-Fwd EEAL architecture outperforms all the 3 baselines.

EEAL outperforms both Aggressive 6issue-4exe andLess-aggressive 4issue-4exe architectures by 1.6% and 2.2% onaverage respectively.


Results

0.99

1

1.01

1.02

1.03

1.04

1.05

1.06

1.07

1.08

bzip2m

cfm

ilch264ref

libquantum

astarom

netpp

xalancbmk

sjengbwaves

zeusmp

cactusADM

calculix

GemsFDTD

lbm namdsoplex

tontogm

ean

Nor

mal

ized

Spe

edup

Less Aggressive: 4issue-2exeLess Aggressive: 4issue-4exe

Aggressive: 6issue-4exe

Early-Fwd EEAL architecture outperforms all the 3 baselines.

EEAL outperforms both Aggressive 6issue-4exe andLess-aggressive 4issue-4exe architectures by 1.6% and 2.2% onaverage respectively.


Memory Order Violation

I3: STR R3, [R4]

I4: STR R4, [R4]

I5: LDR R5, [R4]

Instructions I3, I4 and I5 memory-alias with each other.

Initially all three instructions wait in 2nd RS.

When source operand of R3 becomes ready, I3 executes and alsoforwards its data to waiting load of I5.

However, this is memory order violation as I5 should take datafrom I4 as I4 is the last store that will write to location [R4].

Such memory order violations causes performance penalties.

Solution: In EEAL, dynamic forwarding helps in avoiding suchviolations.



I3: STR R3, [R4]

I4: STR R4, [R4]

I5: LDR R5, [R4]









I3: STR R3, [R4]

I4: STR R4, [R4]

I5: LDR R5, [R4]









I3: STR R3, [R4]

I4: STR R4, [R4]

I5: LDR R5, [R4]









I3: STR R3, [R4]

I4: STR R4, [R4]

I5: LDR R5, [R4]









I3: STR R3, [R4]

I4: STR R4, [R4]

I5: LDR R5, [R4]









I3: STR R3, [R4]

I4: STR R4, [R4]

I5: LDR R5, [R4]








Dynamic Forwarding

0

0.2

0.4

0.6

0.8

1

1.2

h264ref sjeng

Nor

mal

ized

Red

uctio

n in

Vio

latio

ns

early-fwddynamic-fwd

Loads which result in repetitive memory order violations aredynamically identified.

For example: 4 loads in case of h264ref and 9 loads in case of sjengcaused about 90% of total violations

Such identified loads do not participate in further data forwarding.

With dynamic forwarding, violations reduced by about 90%


Dynamic Forwarding

1

1.005

1.01

1.015

1.02

1.025

1.03

1.035

1.04

sjeng h264ref

Nor

mal

ized

IPC

Impr

ovem

ent

early-fwddynamic-fwd

Further performance improved by 0.5% and 0.3% for h264ref and sjengbenchmarks respectively


Area and Power Analysis

1

1.0002

1.0004

1.0006

1.0008

1.001

1.0012

1.0014

1.0016

bzip2GemsFDTD

tontolibquantum

mcfmilc

namdbwaves

soplex

astarcalculix

h264ref

lbm sjengzeusmp

gmean

Nor

mal

ized

Pow

er

Extra hardware: Two Address Generation Units (AGU), Forwardingtable (32 entries) and Bypassing logic.

Area and power analysis was done using McPAT for 22nm technology

0.1% area overhead and 0.05% power overhead


Conclusions

A novel architecture of using 2-level RS was proposed.

Targeted towards improving single thread performance.Can be built upon existing state-of-the-art architecture.

In the pipeline itself data can be early forwarded from store to thecorresponding aliasing load.

Decreased the execution latency for replay loads.

Performance improvement is up to 7.32% (2.2% on average)compared to less aggressive baseline architecture.

The power overhead is 0.05% and area overhead is 0.1%


Conclusions


Targeted towards improving single thread performance.

Can be built upon existing state-of-the-art architecture.






Conclusions








Conclusions








Conclusions








Conclusions








Conclusions








Thank You

For Queries, Please Contact

[email protected]


SIMULATION DATA


INTEGER BENCHMARKS


FLOATING POINT BENCHMARKS


Documents

EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads