68
EEAL: Processors’ Performance Enhancement Through Early Execution of Aliased Loads Abhishek Rajgadia, Newton and Virendra Singh Computer Architecture and Dependable Systems Lab Department of Electrical Engineering Indian Institute of Technology Bombay Mumbai, India Presenter: Ankit Jindal 27 th ACM Great Lakes Symposium on VLSI (GLSVLSI) 10 th May, 2017 GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10 th May, 2017 1 / 27

EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

EEAL: Processors’ Performance EnhancementThrough Early Execution of Aliased Loads

Abhishek Rajgadia, Newton and Virendra Singh

Computer Architecture and Dependable Systems LabDepartment of Electrical EngineeringIndian Institute of Technology Bombay

Mumbai, India

Presenter: Ankit Jindal

27th ACM Great Lakes Symposium on VLSI (GLSVLSI)

10th May, 2017

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 1 / 27

Page 2: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Introduction

Improving single thread performance has been a challenge

Frequency scaling responsible for exponential improvements insingle thread performance has stopped.Saturation in extraction of Instruction Level Parallelism (ILP).

Efficient execution of memory instructions can be an option toincrease single thread performance.

Memory instructions comprise of 20-30% of total instructions.1

Load instructions being at the top of critical path should beexecuted as early as possible.

Goal: To execute load instructions at the earliest.

An Out-of-Order processor uses Memory Dependence Predictor(MDP) to execute loads as early as possible.

1Shen John Paul, and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar

Processors. Waveland Press, 2013.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 2 / 27

Page 3: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Introduction

Improving single thread performance has been a challenge

Frequency scaling responsible for exponential improvements insingle thread performance has stopped.

Saturation in extraction of Instruction Level Parallelism (ILP).

Efficient execution of memory instructions can be an option toincrease single thread performance.

Memory instructions comprise of 20-30% of total instructions.1

Load instructions being at the top of critical path should beexecuted as early as possible.

Goal: To execute load instructions at the earliest.

An Out-of-Order processor uses Memory Dependence Predictor(MDP) to execute loads as early as possible.

1Shen John Paul, and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar

Processors. Waveland Press, 2013.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 2 / 27

Page 4: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Introduction

Improving single thread performance has been a challenge

Frequency scaling responsible for exponential improvements insingle thread performance has stopped.Saturation in extraction of Instruction Level Parallelism (ILP).

Efficient execution of memory instructions can be an option toincrease single thread performance.

Memory instructions comprise of 20-30% of total instructions.1

Load instructions being at the top of critical path should beexecuted as early as possible.

Goal: To execute load instructions at the earliest.

An Out-of-Order processor uses Memory Dependence Predictor(MDP) to execute loads as early as possible.

1Shen John Paul, and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar

Processors. Waveland Press, 2013.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 2 / 27

Page 5: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Introduction

Improving single thread performance has been a challenge

Frequency scaling responsible for exponential improvements insingle thread performance has stopped.Saturation in extraction of Instruction Level Parallelism (ILP).

Efficient execution of memory instructions can be an option toincrease single thread performance.

Memory instructions comprise of 20-30% of total instructions.1

Load instructions being at the top of critical path should beexecuted as early as possible.

Goal: To execute load instructions at the earliest.

An Out-of-Order processor uses Memory Dependence Predictor(MDP) to execute loads as early as possible.

1Shen John Paul, and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar

Processors. Waveland Press, 2013.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 2 / 27

Page 6: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Introduction

Improving single thread performance has been a challenge

Frequency scaling responsible for exponential improvements insingle thread performance has stopped.Saturation in extraction of Instruction Level Parallelism (ILP).

Efficient execution of memory instructions can be an option toincrease single thread performance.

Memory instructions comprise of 20-30% of total instructions.1

Load instructions being at the top of critical path should beexecuted as early as possible.

Goal: To execute load instructions at the earliest.

An Out-of-Order processor uses Memory Dependence Predictor(MDP) to execute loads as early as possible.

1Shen John Paul, and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar

Processors. Waveland Press, 2013.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 2 / 27

Page 7: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Introduction

Improving single thread performance has been a challenge

Frequency scaling responsible for exponential improvements insingle thread performance has stopped.Saturation in extraction of Instruction Level Parallelism (ILP).

Efficient execution of memory instructions can be an option toincrease single thread performance.

Memory instructions comprise of 20-30% of total instructions.1

Load instructions being at the top of critical path should beexecuted as early as possible.

Goal: To execute load instructions at the earliest.

An Out-of-Order processor uses Memory Dependence Predictor(MDP) to execute loads as early as possible.

1Shen John Paul, and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar

Processors. Waveland Press, 2013.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 2 / 27

Page 8: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Introduction

Improving single thread performance has been a challenge

Frequency scaling responsible for exponential improvements insingle thread performance has stopped.Saturation in extraction of Instruction Level Parallelism (ILP).

Efficient execution of memory instructions can be an option toincrease single thread performance.

Memory instructions comprise of 20-30% of total instructions.1

Load instructions being at the top of critical path should beexecuted as early as possible.

Goal: To execute load instructions at the earliest.

An Out-of-Order processor uses Memory Dependence Predictor(MDP) to execute loads as early as possible.

1Shen John Paul, and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar

Processors. Waveland Press, 2013.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 2 / 27

Page 9: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Introduction

Improving single thread performance has been a challenge

Frequency scaling responsible for exponential improvements insingle thread performance has stopped.Saturation in extraction of Instruction Level Parallelism (ILP).

Efficient execution of memory instructions can be an option toincrease single thread performance.

Memory instructions comprise of 20-30% of total instructions.1

Load instructions being at the top of critical path should beexecuted as early as possible.

Goal: To execute load instructions at the earliest.

An Out-of-Order processor uses Memory Dependence Predictor(MDP) to execute loads as early as possible.

1Shen John Paul, and Mikko H. Lipasti. Modern Processor Design: Fundamentals of Superscalar

Processors. Waveland Press, 2013.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 2 / 27

Page 10: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Motivation

I1: LDR R1, [R2]; –Miss (assumption)I2: ADD R3, R4, R1; –Stalled, dependent on I1I3: STR R3, [R4]; –Stalled, R3 is not readyI4: LDR R5, [R4]; –Stalled, Memory-aliases with I3

For early execution of loads: An OoO CPU with MDP will issue I4before I3

However, LDR in I4 memory-aliases with STR in I3.Performance penalty due to this misprediction by MDR.

Register R4 used for address computation was ready, still MDPmade a misprediction.

In conventional OoO processor: I3 and I4 wait in RS for theirsource operands be ready before being issued to AddressComputation Stage (ACS).

Can we issue I3 and I4 early for their address computation?

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 3 / 27

Page 11: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Motivation

I1: LDR R1, [R2]; –Miss (assumption)I2: ADD R3, R4, R1; –Stalled, dependent on I1I3: STR R3, [R4]; –Stalled, R3 is not readyI4: LDR R5, [R4]; –Stalled, Memory-aliases with I3

For early execution of loads: An OoO CPU with MDP will issue I4before I3

However, LDR in I4 memory-aliases with STR in I3.Performance penalty due to this misprediction by MDR.

Register R4 used for address computation was ready, still MDPmade a misprediction.

In conventional OoO processor: I3 and I4 wait in RS for theirsource operands be ready before being issued to AddressComputation Stage (ACS).

Can we issue I3 and I4 early for their address computation?

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 3 / 27

Page 12: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Motivation

I1: LDR R1, [R2]; –Miss (assumption)I2: ADD R3, R4, R1; –Stalled, dependent on I1I3: STR R3, [R4]; –Stalled, R3 is not readyI4: LDR R5, [R4]; –Stalled, Memory-aliases with I3

For early execution of loads: An OoO CPU with MDP will issue I4before I3

However, LDR in I4 memory-aliases with STR in I3.

Performance penalty due to this misprediction by MDR.

Register R4 used for address computation was ready, still MDPmade a misprediction.

In conventional OoO processor: I3 and I4 wait in RS for theirsource operands be ready before being issued to AddressComputation Stage (ACS).

Can we issue I3 and I4 early for their address computation?

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 3 / 27

Page 13: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Motivation

I1: LDR R1, [R2]; –Miss (assumption)I2: ADD R3, R4, R1; –Stalled, dependent on I1I3: STR R3, [R4]; –Stalled, R3 is not readyI4: LDR R5, [R4]; –Stalled, Memory-aliases with I3

For early execution of loads: An OoO CPU with MDP will issue I4before I3

However, LDR in I4 memory-aliases with STR in I3.Performance penalty due to this misprediction by MDR.

Register R4 used for address computation was ready, still MDPmade a misprediction.

In conventional OoO processor: I3 and I4 wait in RS for theirsource operands be ready before being issued to AddressComputation Stage (ACS).

Can we issue I3 and I4 early for their address computation?

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 3 / 27

Page 14: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Motivation

I1: LDR R1, [R2]; –Miss (assumption)I2: ADD R3, R4, R1; –Stalled, dependent on I1I3: STR R3, [R4]; –Stalled, R3 is not readyI4: LDR R5, [R4]; –Stalled, Memory-aliases with I3

For early execution of loads: An OoO CPU with MDP will issue I4before I3

However, LDR in I4 memory-aliases with STR in I3.Performance penalty due to this misprediction by MDR.

Register R4 used for address computation was ready, still MDPmade a misprediction.

In conventional OoO processor: I3 and I4 wait in RS for theirsource operands be ready before being issued to AddressComputation Stage (ACS).

Can we issue I3 and I4 early for their address computation?

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 3 / 27

Page 15: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Motivation

I1: LDR R1, [R2]; –Miss (assumption)I2: ADD R3, R4, R1; –Stalled, dependent on I1I3: STR R3, [R4]; –Stalled, R3 is not readyI4: LDR R5, [R4]; –Stalled, Memory-aliases with I3

For early execution of loads: An OoO CPU with MDP will issue I4before I3

However, LDR in I4 memory-aliases with STR in I3.Performance penalty due to this misprediction by MDR.

Register R4 used for address computation was ready, still MDPmade a misprediction.

In conventional OoO processor: I3 and I4 wait in RS for theirsource operands be ready before being issued to AddressComputation Stage (ACS).

Can we issue I3 and I4 early for their address computation?

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 3 / 27

Page 16: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Motivation

I1: LDR R1, [R2]; –Miss (assumption)I2: ADD R3, R4, R1; –Stalled, dependent on I1I3: STR R3, [R4]; –Stalled, R3 is not readyI4: LDR R5, [R4]; –Stalled, Memory-aliases with I3

For early execution of loads: An OoO CPU with MDP will issue I4before I3

However, LDR in I4 memory-aliases with STR in I3.Performance penalty due to this misprediction by MDR.

Register R4 used for address computation was ready, still MDPmade a misprediction.

In conventional OoO processor: I3 and I4 wait in RS for theirsource operands be ready before being issued to AddressComputation Stage (ACS).

Can we issue I3 and I4 early for their address computation?

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 3 / 27

Page 17: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Execution of Memory Instructions

Execution of memory instruction has 3 stages:

1 Address Computation (Addr)2 Address Translation (TLB)3 Memory Access (Mem)

Memory instructions wait in RS to respect true data dependencies.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 5 / 27

Page 18: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Execution of Memory Instructions

Execution of memory instruction has 3 stages:1 Address Computation (Addr)

2 Address Translation (TLB)3 Memory Access (Mem)

Memory instructions wait in RS to respect true data dependencies.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 5 / 27

Page 19: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Execution of Memory Instructions

Execution of memory instruction has 3 stages:1 Address Computation (Addr)2 Address Translation (TLB)

3 Memory Access (Mem)

Memory instructions wait in RS to respect true data dependencies.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 5 / 27

Page 20: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Execution of Memory Instructions

Execution of memory instruction has 3 stages:1 Address Computation (Addr)2 Address Translation (TLB)3 Memory Access (Mem)

Memory instructions wait in RS to respect true data dependencies.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 5 / 27

Page 21: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Execution of Memory Instructions

Execution of memory instruction has 3 stages:1 Address Computation (Addr)2 Address Translation (TLB)3 Memory Access (Mem)

Memory instructions wait in RS to respect true data dependencies.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 5 / 27

Page 22: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Proposed Modification in Memory Pipeline

RS

Addr

TLB

Mem

1

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 6 / 27

Page 23: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Proposed Modification in Memory Pipeline

RS

Addr

TLB

Mem

RS

Addr

TLB

Mem

1 2

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 7 / 27

Page 24: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Proposed Modification in Memory Pipeline

RS

Addr

TLB

Mem

RS

Addr

TLB

Mem

1st RS

Addr

TLB

Mem

2nd RS

1 2 3

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 8 / 27

Page 25: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Early Execution of Aliased Loads

If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]

I4: LDR R5, [R4]

In our modified architecture

I3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.

Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.

Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27

Page 26: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Early Execution of Aliased Loads

If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]

I4: LDR R5, [R4]

In our modified architecture

I3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.

Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.

Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27

Page 27: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Early Execution of Aliased Loads

If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]

I4: LDR R5, [R4]

In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.

I4 can be issued if R4 is ready.

Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.

Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27

Page 28: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Early Execution of Aliased Loads

If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]

I4: LDR R5, [R4]

In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.

Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.

Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27

Page 29: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Early Execution of Aliased Loads

If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]

I4: LDR R5, [R4]

In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.

Such instructions with their addresses calculated wait in 2nd RS.

Aliasing store/load pairs present in 2nd RS offer followingadvantages.

Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27

Page 30: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Early Execution of Aliased Loads

If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]

I4: LDR R5, [R4]

In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.

Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.

Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27

Page 31: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Early Execution of Aliased Loads

If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]

I4: LDR R5, [R4]

In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.

Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.

Data forwarding from aliasing in-flight store to corresponding load.

Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27

Page 32: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Early Execution of Aliased Loads

If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]

I4: LDR R5, [R4]

In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.

Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.

Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.

Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27

Page 33: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Early Execution of Aliased Loads

If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]

I4: LDR R5, [R4]

In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.

Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.

Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.

Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27

Page 34: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Early Execution of Aliased Loads

If registers required for address computation are ready, memoryinstructions are issued from 1st RS to Address Compuation Stage.I3: STR R3, [R4]

I4: LDR R5, [R4]

In our modified architectureI3 can be issued if R4 is ready, even if R3 is not ready.I4 can be issued if R4 is ready.

Such instructions with their addresses calculated wait in 2nd RS.Aliasing store/load pairs present in 2nd RS offer followingadvantages.

Data forwarding from aliasing in-flight store to corresponding load.Forwarded loads can bypass remaining 2 execution stages i.e.Address Translation and Memory Access.Instructions dependents on forwarded loads will get ready early.Replay loads (mis-speculation aliased loads/stores) will take 1-cycleless for re-execution as mis-speculated memory instructions are beto re-issued from 2ndRS compared to 1stRS in conventional OoO.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 9 / 27

Page 35: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Effect of Decreasing Execution Latency

0.9

0.95

1

1.05

1.1

1.15

bzip2m

ilchm

mer

lbm libquantum

mcf

namd

sjengsoplex

zeusmp

gmean

Spe

edup

rela

tive

to 4

Cyc

les

3 Cycles4 Cycles5 Cycles

Maximum gain of 11.53%, maximum loss of 7.57%.

Thus, elimination of one stage (which can be achieved by 2-levelRS architecture) can have huge impact on performance.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 10 / 27

Page 36: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Proposed Architecture with 2-Level RS

Forwarding data from aliased store to load is done in 2nd RS itself.

Bypassing of remaining stages for such forwarded loads.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 11 / 27

Page 37: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Early Store To Load Forwarding

I3: STR R3, [R4]

I4: LDR R5, [R4]

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 12 / 27

Page 38: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Early Store To Load Forwarding

I3: STR R3, [R4]

I4: LDR R5, [R4]

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 13 / 27

Page 39: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Early Store To Load Forwarding

I3: STR R3, [R4]

I4: LDR R5, [R4]

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 14 / 27

Page 40: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Early Store To Load Forwarding

I3: STR R3, [R4]

I4: LDR R5, [R4]

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 15 / 27

Page 41: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Early Store To Load Forwarding

I3: STR R3, [R4]

I4: LDR R5, [R4]

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 16 / 27

Page 42: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Early Store To Load Forwarding

I3: STR R3, [R4]

I4: LDR R5, [R4]

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 17 / 27

Page 43: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Early Store To Load Forwarding

I3: STR R3, [R4]

I4: LDR R5, [R4]

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 18 / 27

Page 44: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Potential Cases For Early Forwarding

Percentage of 100 million instructions that could have bypassedAddress Translation and Memory Access stages.

Benchmark % Cases

bzip2 2.2

mcf 1.33

lbm 0.79

milc 0.53

cactus 0.2

soplex 0.05

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 19 / 27

Page 45: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Evaluation Configuration

Simulation done using gem5 simulator for 64-bit ARM ISA

18 SPEC CPU2006 benchmarks were used in the study.

1B instructions were fast-forwarded and 100M were executed indetailed mode.

We compared Early-Fwd EEAL architecture with 3 baselines.

Parameters Baselines Early-Fwd

Aggressive Less Aggressive

6issue-4exe 4issue-4exe 4issue-2exe 4issue-2exe

Fetch Width 4

Issue Width 6 4 4 4

Commit Width 4

Mem Exe Units 4 4 2 2

RS 128

Reorder Buffer 192

LSQs 32

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 20 / 27

Page 46: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Results

0.99

1

1.01

1.02

1.03

1.04

1.05

1.06

1.07

1.08

bzip2m

cfm

ilch264ref

libquantum

astarom

netpp

xalancbmk

sjengbwaves

zeusmp

cactusADM

calculix

GemsFDTD

lbm namdsoplex

tontogm

ean

Nor

mal

ized

Spe

edup

Less Aggressive: 4issue-2exeLess Aggressive: 4issue-4exe

Aggressive: 6issue-4exe

Early-Fwd EEAL architecture outperforms all the 3 baselines.

EEAL outperforms both Aggressive 6issue-4exe andLess-aggressive 4issue-4exe architectures by 1.6% and 2.2% onaverage respectively.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 21 / 27

Page 47: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Results

0.99

1

1.01

1.02

1.03

1.04

1.05

1.06

1.07

1.08

bzip2m

cfm

ilch264ref

libquantum

astarom

netpp

xalancbmk

sjengbwaves

zeusmp

cactusADM

calculix

GemsFDTD

lbm namdsoplex

tontogm

ean

Nor

mal

ized

Spe

edup

Less Aggressive: 4issue-2exeLess Aggressive: 4issue-4exe

Aggressive: 6issue-4exe

Early-Fwd EEAL architecture outperforms all the 3 baselines.

EEAL outperforms both Aggressive 6issue-4exe andLess-aggressive 4issue-4exe architectures by 1.6% and 2.2% onaverage respectively.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 21 / 27

Page 48: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Memory Order Violation

I3: STR R3, [R4]

I4: STR R4, [R4]

I5: LDR R5, [R4]

Instructions I3, I4 and I5 memory-alias with each other.

Initially all three instructions wait in 2nd RS.

When source operand of R3 becomes ready, I3 executes and alsoforwards its data to waiting load of I5.

However, this is memory order violation as I5 should take datafrom I4 as I4 is the last store that will write to location [R4].

Such memory order violations causes performance penalties.

Solution: In EEAL, dynamic forwarding helps in avoiding suchviolations.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 22 / 27

Page 49: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Memory Order Violation

I3: STR R3, [R4]

I4: STR R4, [R4]

I5: LDR R5, [R4]

Instructions I3, I4 and I5 memory-alias with each other.

Initially all three instructions wait in 2nd RS.

When source operand of R3 becomes ready, I3 executes and alsoforwards its data to waiting load of I5.

However, this is memory order violation as I5 should take datafrom I4 as I4 is the last store that will write to location [R4].

Such memory order violations causes performance penalties.

Solution: In EEAL, dynamic forwarding helps in avoiding suchviolations.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 22 / 27

Page 50: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Memory Order Violation

I3: STR R3, [R4]

I4: STR R4, [R4]

I5: LDR R5, [R4]

Instructions I3, I4 and I5 memory-alias with each other.

Initially all three instructions wait in 2nd RS.

When source operand of R3 becomes ready, I3 executes and alsoforwards its data to waiting load of I5.

However, this is memory order violation as I5 should take datafrom I4 as I4 is the last store that will write to location [R4].

Such memory order violations causes performance penalties.

Solution: In EEAL, dynamic forwarding helps in avoiding suchviolations.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 22 / 27

Page 51: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Memory Order Violation

I3: STR R3, [R4]

I4: STR R4, [R4]

I5: LDR R5, [R4]

Instructions I3, I4 and I5 memory-alias with each other.

Initially all three instructions wait in 2nd RS.

When source operand of R3 becomes ready, I3 executes and alsoforwards its data to waiting load of I5.

However, this is memory order violation as I5 should take datafrom I4 as I4 is the last store that will write to location [R4].

Such memory order violations causes performance penalties.

Solution: In EEAL, dynamic forwarding helps in avoiding suchviolations.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 22 / 27

Page 52: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Memory Order Violation

I3: STR R3, [R4]

I4: STR R4, [R4]

I5: LDR R5, [R4]

Instructions I3, I4 and I5 memory-alias with each other.

Initially all three instructions wait in 2nd RS.

When source operand of R3 becomes ready, I3 executes and alsoforwards its data to waiting load of I5.

However, this is memory order violation as I5 should take datafrom I4 as I4 is the last store that will write to location [R4].

Such memory order violations causes performance penalties.

Solution: In EEAL, dynamic forwarding helps in avoiding suchviolations.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 22 / 27

Page 53: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Memory Order Violation

I3: STR R3, [R4]

I4: STR R4, [R4]

I5: LDR R5, [R4]

Instructions I3, I4 and I5 memory-alias with each other.

Initially all three instructions wait in 2nd RS.

When source operand of R3 becomes ready, I3 executes and alsoforwards its data to waiting load of I5.

However, this is memory order violation as I5 should take datafrom I4 as I4 is the last store that will write to location [R4].

Such memory order violations causes performance penalties.

Solution: In EEAL, dynamic forwarding helps in avoiding suchviolations.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 22 / 27

Page 54: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Memory Order Violation

I3: STR R3, [R4]

I4: STR R4, [R4]

I5: LDR R5, [R4]

Instructions I3, I4 and I5 memory-alias with each other.

Initially all three instructions wait in 2nd RS.

When source operand of R3 becomes ready, I3 executes and alsoforwards its data to waiting load of I5.

However, this is memory order violation as I5 should take datafrom I4 as I4 is the last store that will write to location [R4].

Such memory order violations causes performance penalties.

Solution: In EEAL, dynamic forwarding helps in avoiding suchviolations.

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 22 / 27

Page 55: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Dynamic Forwarding

0

0.2

0.4

0.6

0.8

1

1.2

h264ref sjeng

Nor

mal

ized

Red

uctio

n in

Vio

latio

ns

early-fwddynamic-fwd

Loads which result in repetitive memory order violations aredynamically identified.

For example: 4 loads in case of h264ref and 9 loads in case of sjengcaused about 90% of total violations

Such identified loads do not participate in further data forwarding.

With dynamic forwarding, violations reduced by about 90%

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 23 / 27

Page 56: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Dynamic Forwarding

1

1.005

1.01

1.015

1.02

1.025

1.03

1.035

1.04

sjeng h264ref

Nor

mal

ized

IPC

Impr

ovem

ent

early-fwddynamic-fwd

Further performance improved by 0.5% and 0.3% for h264ref and sjengbenchmarks respectively

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 24 / 27

Page 57: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Area and Power Analysis

1

1.0002

1.0004

1.0006

1.0008

1.001

1.0012

1.0014

1.0016

bzip2GemsFDTD

tontolibquantum

mcfmilc

namdbwaves

soplex

astarcalculix

h264ref

lbm sjengzeusmp

gmean

Nor

mal

ized

Pow

er

Extra hardware: Two Address Generation Units (AGU), Forwardingtable (32 entries) and Bypassing logic.

Area and power analysis was done using McPAT for 22nm technology

0.1% area overhead and 0.05% power overhead

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 25 / 27

Page 58: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Conclusions

A novel architecture of using 2-level RS was proposed.

Targeted towards improving single thread performance.Can be built upon existing state-of-the-art architecture.

In the pipeline itself data can be early forwarded from store to thecorresponding aliasing load.

Decreased the execution latency for replay loads.

Performance improvement is up to 7.32% (2.2% on average)compared to less aggressive baseline architecture.

The power overhead is 0.05% and area overhead is 0.1%

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 26 / 27

Page 59: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Conclusions

A novel architecture of using 2-level RS was proposed.

Targeted towards improving single thread performance.

Can be built upon existing state-of-the-art architecture.

In the pipeline itself data can be early forwarded from store to thecorresponding aliasing load.

Decreased the execution latency for replay loads.

Performance improvement is up to 7.32% (2.2% on average)compared to less aggressive baseline architecture.

The power overhead is 0.05% and area overhead is 0.1%

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 26 / 27

Page 60: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Conclusions

A novel architecture of using 2-level RS was proposed.

Targeted towards improving single thread performance.Can be built upon existing state-of-the-art architecture.

In the pipeline itself data can be early forwarded from store to thecorresponding aliasing load.

Decreased the execution latency for replay loads.

Performance improvement is up to 7.32% (2.2% on average)compared to less aggressive baseline architecture.

The power overhead is 0.05% and area overhead is 0.1%

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 26 / 27

Page 61: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Conclusions

A novel architecture of using 2-level RS was proposed.

Targeted towards improving single thread performance.Can be built upon existing state-of-the-art architecture.

In the pipeline itself data can be early forwarded from store to thecorresponding aliasing load.

Decreased the execution latency for replay loads.

Performance improvement is up to 7.32% (2.2% on average)compared to less aggressive baseline architecture.

The power overhead is 0.05% and area overhead is 0.1%

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 26 / 27

Page 62: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Conclusions

A novel architecture of using 2-level RS was proposed.

Targeted towards improving single thread performance.Can be built upon existing state-of-the-art architecture.

In the pipeline itself data can be early forwarded from store to thecorresponding aliasing load.

Decreased the execution latency for replay loads.

Performance improvement is up to 7.32% (2.2% on average)compared to less aggressive baseline architecture.

The power overhead is 0.05% and area overhead is 0.1%

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 26 / 27

Page 63: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Conclusions

A novel architecture of using 2-level RS was proposed.

Targeted towards improving single thread performance.Can be built upon existing state-of-the-art architecture.

In the pipeline itself data can be early forwarded from store to thecorresponding aliasing load.

Decreased the execution latency for replay loads.

Performance improvement is up to 7.32% (2.2% on average)compared to less aggressive baseline architecture.

The power overhead is 0.05% and area overhead is 0.1%

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 26 / 27

Page 64: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Conclusions

A novel architecture of using 2-level RS was proposed.

Targeted towards improving single thread performance.Can be built upon existing state-of-the-art architecture.

In the pipeline itself data can be early forwarded from store to thecorresponding aliasing load.

Decreased the execution latency for replay loads.

Performance improvement is up to 7.32% (2.2% on average)compared to less aggressive baseline architecture.

The power overhead is 0.05% and area overhead is 0.1%

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 26 / 27

Page 65: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

Thank You

For Queries, Please Contact

[email protected]

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 27 / 27

Page 66: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

SIMULATION DATA

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 28 / 27

Page 67: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

INTEGER BENCHMARKS

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 29 / 27

Page 68: EEAL: Processors Performance Enhancement Through Early ...newton/eeal_glsvlsi17_presentation.pdf · An Out-of-Order processor uses Memory Dependence Predictor (MDP) to execute loads

FLOATING POINT BENCHMARKS

GLSVLSI 2017 (Banff, Canada) Early Execution of Aliased Loads 10th May, 2017 30 / 27