Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
EEAL: Processors’ Performance EnhancementThrough Early Execution of Aliased Loads
Abhishek Rajgadia, Newton and Virendra Singh
Computer Architecture and Dependable Systems LabDepartment of Electrical EngineeringIndian Institute of Technology Bombay
Mumbai, India
Presenter: Ankit Jindal
27th ACM Great Lakes Symposium on VLSI (GLSVLSI)
10th May, 2017
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 1 / 27
Introduction
Improving single thread performance has been a challenge
Frequency scaling responsible for exponential improvements insingle thread performance has stopped.Saturation in extraction of Instruction Level Parallelism (ILP).
Efficient execution of memory instructions can be an option toincrease single thread performance.
Memory instructions comprise of 20-30% of total instructions.1
Load instructions being at the top of critical path should beexecuted as early as possible.
Goal: To execute load instructions at the earliest.
An Out-of-Order processor uses Memory Dependence Predictor(MDP) to execute loads as early as possible.
1Shen John Paul, and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar
Processors. Waveland Press, 2013.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 2 / 27
Introduction
Improving single thread performance has been a challenge
Frequency scaling responsible for exponential improvements insingle thread performance has stopped.
Saturation in extraction of Instruction Level Parallelism (ILP).
Efficient execution of memory instructions can be an option toincrease single thread performance.
Memory instructions comprise of 20-30% of total instructions.1
Load instructions being at the top of critical path should beexecuted as early as possible.
Goal: To execute load instructions at the earliest.
An Out-of-Order processor uses Memory Dependence Predictor(MDP) to execute loads as early as possible.
1Shen John Paul, and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar
Processors. Waveland Press, 2013.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 2 / 27
Introduction
Improving single thread performance has been a challenge
Frequency scaling responsible for exponential improvements insingle thread performance has stopped.Saturation in extraction of Instruction Level Parallelism (ILP).
Efficient execution of memory instructions can be an option toincrease single thread performance.
Memory instructions comprise of 20-30% of total instructions.1
Load instructions being at the top of critical path should beexecuted as early as possible.
Goal: To execute load instructions at the earliest.
An Out-of-Order processor uses Memory Dependence Predictor(MDP) to execute loads as early as possible.
1Shen John Paul, and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar
Processors. Waveland Press, 2013.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 2 / 27
Introduction
Improving single thread performance has been a challenge
Frequency scaling responsible for exponential improvements insingle thread performance has stopped.Saturation in extraction of Instruction Level Parallelism (ILP).
Efficient execution of memory instructions can be an option toincrease single thread performance.
Memory instructions comprise of 20-30% of total instructions.1
Load instructions being at the top of critical path should beexecuted as early as possible.
Goal: To execute load instructions at the earliest.
An Out-of-Order processor uses Memory Dependence Predictor(MDP) to execute loads as early as possible.
1Shen John Paul, and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar
Processors. Waveland Press, 2013.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 2 / 27
Introduction
Improving single thread performance has been a challenge
Frequency scaling responsible for exponential improvements insingle thread performance has stopped.Saturation in extraction of Instruction Level Parallelism (ILP).
Efficient execution of memory instructions can be an option toincrease single thread performance.
Memory instructions comprise of 20-30% of total instructions.1
Load instructions being at the top of critical path should beexecuted as early as possible.
Goal: To execute load instructions at the earliest.
An Out-of-Order processor uses Memory Dependence Predictor(MDP) to execute loads as early as possible.
1Shen John Paul, and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar
Processors. Waveland Press, 2013.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 2 / 27
Introduction
Improving single thread performance has been a challenge
Frequency scaling responsible for exponential improvements insingle thread performance has stopped.Saturation in extraction of Instruction Level Parallelism (ILP).
Efficient execution of memory instructions can be an option toincrease single thread performance.
Memory instructions comprise of 20-30% of total instructions.1
Load instructions being at the top of critical path should beexecuted as early as possible.
Goal: To execute load instructions at the earliest.
An Out-of-Order processor uses Memory Dependence Predictor(MDP) to execute loads as early as possible.
1Shen John Paul, and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar
Processors. Waveland Press, 2013.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 2 / 27
Introduction
Improving single thread performance has been a challenge
Frequency scaling responsible for exponential improvements insingle thread performance has stopped.Saturation in extraction of Instruction Level Parallelism (ILP).
Efficient execution of memory instructions can be an option toincrease single thread performance.
Memory instructions comprise of 20-30% of total instructions.1
Load instructions being at the top of critical path should beexecuted as early as possible.
Goal: To execute load instructions at the earliest.
An Out-of-Order processor uses Memory Dependence Predictor(MDP) to execute loads as early as possible.
1Shen John Paul, and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar
Processors. Waveland Press, 2013.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 2 / 27
Introduction
Improving single thread performance has been a challenge
Frequency scaling responsible for exponential improvements insingle thread performance has stopped.Saturation in extraction of Instruction Level Parallelism (ILP).
Efficient execution of memory instructions can be an option toincrease single thread performance.
Memory instructions comprise of 20-30% of total instructions.1
Load instructions being at the top of critical path should beexecuted as early as possible.
Goal: To execute load instructions at the earliest.
An Out-of-Order processor uses Memory Dependence Predictor(MDP) to execute loads as early as possible.
1Shen John Paul, and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar
Processors. Waveland Press, 2013.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 2 / 27
Motivation
I1: LDR R1, [R2]; –Miss (assumption)I2: ADD R3, R4, R1; –Stalled, dependent on I1I3: STR R3, [R4]; –Stalled, R3 is not readyI4: LDR R5, [R4]; –Stalled, Memory-aliases with I3
For early execution of loads: An OoO CPU with MDP will issue I4before I3
However, LDR in I4 memory-aliases with STR in I3.Performance penalty due to this misprediction by MDR.
Register R4 used for address computation was ready, still MDPmade a misprediction.
In conventional OoO processor: I3 and I4 wait in RS for theirsource operands be ready before being issued to AddressComputation Stage (ACS).
Can we issue I3 and I4 early for their address computation?
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 3 / 27
Motivation
I1: LDR R1, [R2]; –Miss (assumption)I2: ADD R3, R4, R1; –Stalled, dependent on I1I3: STR R3, [R4]; –Stalled, R3 is not readyI4: LDR R5, [R4]; –Stalled, Memory-aliases with I3
For early execution of loads: An OoO CPU with MDP will issue I4before I3
However, LDR in I4 memory-aliases with STR in I3.Performance penalty due to this misprediction by MDR.
Register R4 used for address computation was ready, still MDPmade a misprediction.
In conventional OoO processor: I3 and I4 wait in RS for theirsource operands be ready before being issued to AddressComputation Stage (ACS).
Can we issue I3 and I4 early for their address computation?
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 3 / 27
Motivation
I1: LDR R1, [R2]; –Miss (assumption)I2: ADD R3, R4, R1; –Stalled, dependent on I1I3: STR R3, [R4]; –Stalled, R3 is not readyI4: LDR R5, [R4]; –Stalled, Memory-aliases with I3
For early execution of loads: An OoO CPU with MDP will issue I4before I3
However, LDR in I4 memory-aliases with STR in I3.
Performance penalty due to this misprediction by MDR.
Register R4 used for address computation was ready, still MDPmade a misprediction.
In conventional OoO processor: I3 and I4 wait in RS for theirsource operands be ready before being issued to AddressComputation Stage (ACS).
Can we issue I3 and I4 early for their address computation?
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 3 / 27
Motivation
I1: LDR R1, [R2]; –Miss (assumption)I2: ADD R3, R4, R1; –Stalled, dependent on I1I3: STR R3, [R4]; –Stalled, R3 is not readyI4: LDR R5, [R4]; –Stalled, Memory-aliases with I3
For early execution of loads: An OoO CPU with MDP will issue I4before I3
However, LDR in I4 memory-aliases with STR in I3.Performance penalty due to this misprediction by MDR.
Register R4 used for address computation was ready, still MDPmade a misprediction.
In conventional OoO processor: I3 and I4 wait in RS for theirsource operands be ready before being issued to AddressComputation Stage (ACS).
Can we issue I3 and I4 early for their address computation?
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 3 / 27
Motivation
I1: LDR R1, [R2]; –Miss (assumption)I2: ADD R3, R4, R1; –Stalled, dependent on I1I3: STR R3, [R4]; –Stalled, R3 is not readyI4: LDR R5, [R4]; –Stalled, Memory-aliases with I3
For early execution of loads: An OoO CPU with MDP will issue I4before I3
However, LDR in I4 memory-aliases with STR in I3.Performance penalty due to this misprediction by MDR.
Register R4 used for address computation was ready, still MDPmade a misprediction.
In conventional OoO processor: I3 and I4 wait in RS for theirsource operands be ready before being issued to AddressComputation Stage (ACS).
Can we issue I3 and I4 early for their address computation?
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 3 / 27
Motivation
I1: LDR R1, [R2]; –Miss (assumption)I2: ADD R3, R4, R1; –Stalled, dependent on I1I3: STR R3, [R4]; –Stalled, R3 is not readyI4: LDR R5, [R4]; –Stalled, Memory-aliases with I3
For early execution of loads: An OoO CPU with MDP will issue I4before I3
However, LDR in I4 memory-aliases with STR in I3.Performance penalty due to this misprediction by MDR.
Register R4 used for address computation was ready, still MDPmade a misprediction.
In conventional OoO processor: I3 and I4 wait in RS for theirsource operands be ready before being issued to AddressComputation Stage (ACS).
Can we issue I3 and I4 early for their address computation?
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 3 / 27
Motivation
I1: LDR R1, [R2]; –Miss (assumption)I2: ADD R3, R4, R1; –Stalled, dependent on I1I3: STR R3, [R4]; –Stalled, R3 is not readyI4: LDR R5, [R4]; –Stalled, Memory-aliases with I3
For early execution of loads: An OoO CPU with MDP will issue I4before I3
However, LDR in I4 memory-aliases with STR in I3.Performance penalty due to this misprediction by MDR.
Register R4 used for address computation was ready, still MDPmade a misprediction.
In conventional OoO processor: I3 and I4 wait in RS for theirsource operands be ready before being issued to AddressComputation Stage (ACS).
Can we issue I3 and I4 early for their address computation?
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 3 / 27
Execution of Memory Instructions
Execution of memory instruction has 3 stages:
1 Address Computation (Addr)2 Address Translation (TLB)3 Memory Access (Mem)
Memory instructions wait in RS to respect true data dependencies.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 5 / 27
Execution of Memory Instructions
Execution of memory instruction has 3 stages:1 Address Computation (Addr)
2 Address Translation (TLB)3 Memory Access (Mem)
Memory instructions wait in RS to respect true data dependencies.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 5 / 27
Execution of Memory Instructions
Execution of memory instruction has 3 stages:1 Address Computation (Addr)2 Address Translation (TLB)
3 Memory Access (Mem)
Memory instructions wait in RS to respect true data dependencies.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 5 / 27
Execution of Memory Instructions
Execution of memory instruction has 3 stages:1 Address Computation (Addr)2 Address Translation (TLB)3 Memory Access (Mem)
Memory instructions wait in RS to respect true data dependencies.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 5 / 27
Execution of Memory Instructions
Execution of memory instruction has 3 stages:1 Address Computation (Addr)2 Address Translation (TLB)3 Memory Access (Mem)
Memory instructions wait in RS to respect true data dependencies.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 5 / 27
Proposed Modification in Memory Pipeline
RS
Addr
TLB
Mem
1
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 6 / 27
Proposed Modification in Memory Pipeline
RS
Addr
TLB
Mem
RS
Addr
TLB
Mem
1 2
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 7 / 27
Proposed Modification in Memory Pipeline
RS
Addr
TLB
Mem
RS
Addr
TLB
Mem
1st RS
Addr
TLB
Mem
2nd RS
1 2 3
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 8 / 27
Early Execution of Aliased Loads
If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]
I4: LDR R5, [R4]
In our modified architecture
I3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.
Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.
Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27
Early Execution of Aliased Loads
If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]
I4: LDR R5, [R4]
In our modified architecture
I3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.
Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.
Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27
Early Execution of Aliased Loads
If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]
I4: LDR R5, [R4]
In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.
I4 can be issued if R4 is ready.
Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.
Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27
Early Execution of Aliased Loads
If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]
I4: LDR R5, [R4]
In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.
Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.
Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27
Early Execution of Aliased Loads
If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]
I4: LDR R5, [R4]
In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.
Such instructions with their addresses calculated wait in 2nd RS.
Aliasing store/load pairs present in 2nd RS offer followingadvantages.
Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27
Early Execution of Aliased Loads
If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]
I4: LDR R5, [R4]
In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.
Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.
Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27
Early Execution of Aliased Loads
If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]
I4: LDR R5, [R4]
In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.
Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.
Data forwarding from aliasing in-flight store to corresponding load.
Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27
Early Execution of Aliased Loads
If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]
I4: LDR R5, [R4]
In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.
Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.
Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.
Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27
Early Execution of Aliased Loads
If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]
I4: LDR R5, [R4]
In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.
Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.
Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.
Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27
Early Execution of Aliased Loads
If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]
I4: LDR R5, [R4]
In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.
Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.
Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27
Effect of Decreasing Execution Latency
0.9
0.95
1
1.05
1.1
1.15
bzip2m
ilchm
mer
lbm libquantum
mcf
namd
sjengsoplex
zeusmp
gmean
Spe
edup
rela
tive
to 4
Cyc
les
3 Cycles4 Cycles5 Cycles
Maximum gain of 11.53%, maximum loss of 7.57%.
Thus, elimination of one stage (which can be achieved by 2-levelRS architecture) can have huge impact on performance.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 10 / 27
Proposed Architecture with 2-Level RS
Forwarding data from aliased store to load is done in 2nd RS itself.
Bypassing of remaining stages for such forwarded loads.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 11 / 27
Early Store To Load Forwarding
I3: STR R3, [R4]
I4: LDR R5, [R4]
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 12 / 27
Early Store To Load Forwarding
I3: STR R3, [R4]
I4: LDR R5, [R4]
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 13 / 27
Early Store To Load Forwarding
I3: STR R3, [R4]
I4: LDR R5, [R4]
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 14 / 27
Early Store To Load Forwarding
I3: STR R3, [R4]
I4: LDR R5, [R4]
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 15 / 27
Early Store To Load Forwarding
I3: STR R3, [R4]
I4: LDR R5, [R4]
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 16 / 27
Early Store To Load Forwarding
I3: STR R3, [R4]
I4: LDR R5, [R4]
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 17 / 27
Early Store To Load Forwarding
I3: STR R3, [R4]
I4: LDR R5, [R4]
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 18 / 27
Potential Cases For Early Forwarding
Percentage of 100 million instructions that could have bypassedAddress Translation and Memory Access stages.
Benchmark % Cases
bzip2 2.2
mcf 1.33
lbm 0.79
milc 0.53
cactus 0.2
soplex 0.05
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 19 / 27
Evaluation Configuration
Simulation done using gem5 simulator for 64-bit ARM ISA
18 SPEC CPU2006 benchmarks were used in the study.
1B instructions were fast-forwarded and 100M were executed indetailed mode.
We compared Early-Fwd EEAL architecture with 3 baselines.
Parameters Baselines Early-Fwd
Aggressive Less Aggressive
6issue-4exe 4issue-4exe 4issue-2exe 4issue-2exe
Fetch Width 4
Issue Width 6 4 4 4
Commit Width 4
Mem Exe Units 4 4 2 2
RS 128
Reorder Buffer 192
LSQs 32
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 20 / 27
Results
0.99
1
1.01
1.02
1.03
1.04
1.05
1.06
1.07
1.08
bzip2m
cfm
ilch264ref
libquantum
astarom
netpp
xalancbmk
sjengbwaves
zeusmp
cactusADM
calculix
GemsFDTD
lbm namdsoplex
tontogm
ean
Nor
mal
ized
Spe
edup
Less Aggressive: 4issue-2exeLess Aggressive: 4issue-4exe
Aggressive: 6issue-4exe
Early-Fwd EEAL architecture outperforms all the 3 baselines.
EEAL outperforms both Aggressive 6issue-4exe andLess-aggressive 4issue-4exe architectures by 1.6% and 2.2% onaverage respectively.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 21 / 27
Results
0.99
1
1.01
1.02
1.03
1.04
1.05
1.06
1.07
1.08
bzip2m
cfm
ilch264ref
libquantum
astarom
netpp
xalancbmk
sjengbwaves
zeusmp
cactusADM
calculix
GemsFDTD
lbm namdsoplex
tontogm
ean
Nor
mal
ized
Spe
edup
Less Aggressive: 4issue-2exeLess Aggressive: 4issue-4exe
Aggressive: 6issue-4exe
Early-Fwd EEAL architecture outperforms all the 3 baselines.
EEAL outperforms both Aggressive 6issue-4exe andLess-aggressive 4issue-4exe architectures by 1.6% and 2.2% onaverage respectively.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 21 / 27
Memory Order Violation
I3: STR R3, [R4]
I4: STR R4, [R4]
I5: LDR R5, [R4]
Instructions I3, I4 and I5 memory-alias with each other.
Initially all three instructions wait in 2nd RS.
When source operand of R3 becomes ready, I3 executes and alsoforwards its data to waiting load of I5.
However, this is memory order violation as I5 should take datafrom I4 as I4 is the last store that will write to location [R4].
Such memory order violations causes performance penalties.
Solution: In EEAL, dynamic forwarding helps in avoiding suchviolations.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 22 / 27
Memory Order Violation
I3: STR R3, [R4]
I4: STR R4, [R4]
I5: LDR R5, [R4]
Instructions I3, I4 and I5 memory-alias with each other.
Initially all three instructions wait in 2nd RS.
When source operand of R3 becomes ready, I3 executes and alsoforwards its data to waiting load of I5.
However, this is memory order violation as I5 should take datafrom I4 as I4 is the last store that will write to location [R4].
Such memory order violations causes performance penalties.
Solution: In EEAL, dynamic forwarding helps in avoiding suchviolations.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 22 / 27
Memory Order Violation
I3: STR R3, [R4]
I4: STR R4, [R4]
I5: LDR R5, [R4]
Instructions I3, I4 and I5 memory-alias with each other.
Initially all three instructions wait in 2nd RS.
When source operand of R3 becomes ready, I3 executes and alsoforwards its data to waiting load of I5.
However, this is memory order violation as I5 should take datafrom I4 as I4 is the last store that will write to location [R4].
Such memory order violations causes performance penalties.
Solution: In EEAL, dynamic forwarding helps in avoiding suchviolations.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 22 / 27
Memory Order Violation
I3: STR R3, [R4]
I4: STR R4, [R4]
I5: LDR R5, [R4]
Instructions I3, I4 and I5 memory-alias with each other.
Initially all three instructions wait in 2nd RS.
When source operand of R3 becomes ready, I3 executes and alsoforwards its data to waiting load of I5.
However, this is memory order violation as I5 should take datafrom I4 as I4 is the last store that will write to location [R4].
Such memory order violations causes performance penalties.
Solution: In EEAL, dynamic forwarding helps in avoiding suchviolations.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 22 / 27
Memory Order Violation
I3: STR R3, [R4]
I4: STR R4, [R4]
I5: LDR R5, [R4]
Instructions I3, I4 and I5 memory-alias with each other.
Initially all three instructions wait in 2nd RS.
When source operand of R3 becomes ready, I3 executes and alsoforwards its data to waiting load of I5.
However, this is memory order violation as I5 should take datafrom I4 as I4 is the last store that will write to location [R4].
Such memory order violations causes performance penalties.
Solution: In EEAL, dynamic forwarding helps in avoiding suchviolations.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 22 / 27
Memory Order Violation
I3: STR R3, [R4]
I4: STR R4, [R4]
I5: LDR R5, [R4]
Instructions I3, I4 and I5 memory-alias with each other.
Initially all three instructions wait in 2nd RS.
When source operand of R3 becomes ready, I3 executes and alsoforwards its data to waiting load of I5.
However, this is memory order violation as I5 should take datafrom I4 as I4 is the last store that will write to location [R4].
Such memory order violations causes performance penalties.
Solution: In EEAL, dynamic forwarding helps in avoiding suchviolations.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 22 / 27
Memory Order Violation
I3: STR R3, [R4]
I4: STR R4, [R4]
I5: LDR R5, [R4]
Instructions I3, I4 and I5 memory-alias with each other.
Initially all three instructions wait in 2nd RS.
When source operand of R3 becomes ready, I3 executes and alsoforwards its data to waiting load of I5.
However, this is memory order violation as I5 should take datafrom I4 as I4 is the last store that will write to location [R4].
Such memory order violations causes performance penalties.
Solution: In EEAL, dynamic forwarding helps in avoiding suchviolations.
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 22 / 27
Dynamic Forwarding
0
0.2
0.4
0.6
0.8
1
1.2
h264ref sjeng
Nor
mal
ized
Red
uctio
n in
Vio
latio
ns
early-fwddynamic-fwd
Loads which result in repetitive memory order violations aredynamically identified.
For example: 4 loads in case of h264ref and 9 loads in case of sjengcaused about 90% of total violations
Such identified loads do not participate in further data forwarding.
With dynamic forwarding, violations reduced by about 90%
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 23 / 27
Dynamic Forwarding
1
1.005
1.01
1.015
1.02
1.025
1.03
1.035
1.04
sjeng h264ref
Nor
mal
ized
IPC
Impr
ovem
ent
early-fwddynamic-fwd
Further performance improved by 0.5% and 0.3% for h264ref and sjengbenchmarks respectively
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 24 / 27
Area and Power Analysis
1
1.0002
1.0004
1.0006
1.0008
1.001
1.0012
1.0014
1.0016
bzip2GemsFDTD
tontolibquantum
mcfmilc
namdbwaves
soplex
astarcalculix
h264ref
lbm sjengzeusmp
gmean
Nor
mal
ized
Pow
er
Extra hardware: Two Address Generation Units (AGU), Forwardingtable (32 entries) and Bypassing logic.
Area and power analysis was done using McPAT for 22nm technology
0.1% area overhead and 0.05% power overhead
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 25 / 27
Conclusions
A novel architecture of using 2-level RS was proposed.
Targeted towards improving single thread performance.Can be built upon existing state-of-the-art architecture.
In the pipeline itself data can be early forwarded from store to thecorresponding aliasing load.
Decreased the execution latency for replay loads.
Performance improvement is up to 7.32% (2.2% on average)compared to less aggressive baseline architecture.
The power overhead is 0.05% and area overhead is 0.1%
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 26 / 27
Conclusions
A novel architecture of using 2-level RS was proposed.
Targeted towards improving single thread performance.
Can be built upon existing state-of-the-art architecture.
In the pipeline itself data can be early forwarded from store to thecorresponding aliasing load.
Decreased the execution latency for replay loads.
Performance improvement is up to 7.32% (2.2% on average)compared to less aggressive baseline architecture.
The power overhead is 0.05% and area overhead is 0.1%
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 26 / 27
Conclusions
A novel architecture of using 2-level RS was proposed.
Targeted towards improving single thread performance.Can be built upon existing state-of-the-art architecture.
In the pipeline itself data can be early forwarded from store to thecorresponding aliasing load.
Decreased the execution latency for replay loads.
Performance improvement is up to 7.32% (2.2% on average)compared to less aggressive baseline architecture.
The power overhead is 0.05% and area overhead is 0.1%
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 26 / 27
Conclusions
A novel architecture of using 2-level RS was proposed.
Targeted towards improving single thread performance.Can be built upon existing state-of-the-art architecture.
In the pipeline itself data can be early forwarded from store to thecorresponding aliasing load.
Decreased the execution latency for replay loads.
Performance improvement is up to 7.32% (2.2% on average)compared to less aggressive baseline architecture.
The power overhead is 0.05% and area overhead is 0.1%
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 26 / 27
Conclusions
A novel architecture of using 2-level RS was proposed.
Targeted towards improving single thread performance.Can be built upon existing state-of-the-art architecture.
In the pipeline itself data can be early forwarded from store to thecorresponding aliasing load.
Decreased the execution latency for replay loads.
Performance improvement is up to 7.32% (2.2% on average)compared to less aggressive baseline architecture.
The power overhead is 0.05% and area overhead is 0.1%
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 26 / 27
Conclusions
A novel architecture of using 2-level RS was proposed.
Targeted towards improving single thread performance.Can be built upon existing state-of-the-art architecture.
In the pipeline itself data can be early forwarded from store to thecorresponding aliasing load.
Decreased the execution latency for replay loads.
Performance improvement is up to 7.32% (2.2% on average)compared to less aggressive baseline architecture.
The power overhead is 0.05% and area overhead is 0.1%
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 26 / 27
Conclusions
A novel architecture of using 2-level RS was proposed.
Targeted towards improving single thread performance.Can be built upon existing state-of-the-art architecture.
In the pipeline itself data can be early forwarded from store to thecorresponding aliasing load.
Decreased the execution latency for replay loads.
Performance improvement is up to 7.32% (2.2% on average)compared to less aggressive baseline architecture.
The power overhead is 0.05% and area overhead is 0.1%
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 26 / 27
Thank You
For Queries, Please Contact
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 27 / 27
SIMULATION DATA
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 28 / 27
INTEGER BENCHMARKS
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 29 / 27
FLOATING POINT BENCHMARKS
GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 30 / 27