PIPELINING AND PROCESSOR PERFORMANCE - … · PIPELINING AND PROCESSOR PERFORMANCE ... Computer Architecture: A Quantitative Approach”, 5th edition, Chapter 1, John L. Hennessy

PIPELINING AND

PROCESSOR PERFORMANCESlides by: Pedro Tomás

Additional reading: Computer Architecture: A Quantitative Approach”, 5th edition, Chapter 1, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

ADVANCED COMPUTER ARCHITECTURES

ARQUITECTURAS AVANÇADAS DE COMPUTADORES (AAC)

Advanced Computer Architectures, 2014

Outline

2

Revision

Single cycle processor

Multi cycle processor

Processor pipelining

Instruction flow in a pipeline processor

Execution conflicts

Evaluating processor performance


Revision of a RISC architectures

Single cycle processor3


Revision of a RISC architectures

Multi-cycle processor4

S

B

A OP

Flags

AL

U

Register File (RF)AA A BA B

AABA

R[BA]

Asynchronous read ports

R[AA]

IMM

IMM

MU

X

MEMEX

SEL A

SEL B

IMM

SEL B

SEL A

OP SEL

DA WE DATA

Synchronous write ports CLK

WB

Data

Address

WE

DataMemory

Data

SEL OUT MUX

Decoder

DAWE

INST

MEM WRITE

IF

Address

InstructionsMemory

Data

INSTRUCTION

PC

CLK

Clock

JMP CTRL

Clo

ck

+

4FLAGS

COND

COND

NEXT PC

AD

ID&OF

MU

X

EnableIF

EnableID&OF

EnableEX

EnableMEM

EnableWB


Instruction flow


Each instruction takes 1 cycle to execute

Clock period limited by the worst case path of the whole processor

Example:

for (i=0,aux=0; i<100; i++) {

if (V[i] > aux){

aux=v[i];

}

}

LI R2,100

LI R1,4

MOVE R3,R0

LW R4,100(R1)

SUB R5,R4,R3

BLEZ R5,LOOP_END

ADDI R1,R1,4

MOVE R3,R4

BNE R2,R1,LOOP_NXT

SLL R2,R2,2

LOOP_END:

LOOP_NXT:

Some of the used instructions (e.g., LI and MOVE) do not actual exist in MIPS64 instruction

set. Hence, they should be replaced with an equivalent instruction such as OR DR,R0,operand.

However, they are left here to simplify the reading of the Assembly code.


Instruction flow


Each instruction takes 1 cycle to execute

Clock period limited by the worst case path of the whole processor

LI R2,100

LI R1,4

MOVE R3,R0

LW R4,100(R1)

SUB R5,R4,R3

BLEZ R5,LOOP_END

ADDI R1,R1,4

MOVE R3,R4

BNE R2,R1,LOOP_NXT

SLL R2,R2,2

LOOP_END:

LOOP_NXT:

IF,ID,EX,MEM,WB

IF,ID,EX,MEM,WB

IF,ID,EX,MEM,WB

IF,ID,EX,MEM,WB

Cycle 1 Cycle 2 Cycle 3 Cycle 4


Instruction flow

Multi cycle processor7

Each instruction takes 5 cycles to execute

Clock period limited by the worst case path of all stages

The working frequency is higher but the instruction throughput is lower

LI R2,100

LI R1,4

MOVE R3,R0

LW R4,100(R1)

SUB R5,R4,R3

BLEZ R5,LOOP_END

ADDI R1,R1,4

MOVE R3,R4

BNE R2,R1,LOOP_NXT

SLL R2,R2,2

LOOP_END:

LOOP_NXT:

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16


LI R2,100

LI R1,4

MOVE R3,R0

LW R4,100(R1)

SUB R5,R4,R3

BLEZ R5,LOOP_END

ADDI R1,R1,4

MOVE R3,R4

BNE R2,R1,LOOP_NXT

SLL R2,R2,2

LOOP_END:

LOOP_NXT:

IF ID EX MEM WB

IF ID EX MEM WB

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

Instruction flow

Pipeline processor8

The processor simultaneously executes a part of up to 5 different

instructions each clock cycle

The instruction throughput increases by up to 5x

Pipeline overview during

clock cycle 7


Instruction flow

Pipeline processor9

The instruction throughput can increase by 5x (potential)

Much higher performance… but…

LI R2,100

LI R1,4

MOVE R3,R0

LW R4,100(R1)

SUB R5,R4,R3

BLEZ R5,LOOP_END

ADDI R1,R1,4

MOVE R3,R4

BNE R2,R1,LOOP_NXT

SLL R2,R2,2

LOOP_END:

LOOP_NXT:

IF ID EX MEM WB

IF ID EX MEM WB

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB


Instruction flow

Pipeline processor10

The instruction throughput can increase by 5x (potential)

Much higher performance… but… it generates conflicts that must be solved

to guarantee the correct behaviour

LI R2,100

LI R1,4

MOVE R3,R0

LW R4,100(R1)

SUB R5,R4,R3

BLEZ R5,LOOP_END

ADDI R1,R1,4

MOVE R3,R4

BNE R2,R1,LOOP_NXT

SLL R2,R2,2

LOOP_END:

LOOP_NXT:

IF ID EX MEM WB

IFRead

R2EX MEM WB

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

IF ID EX MEM WB

IF ID EX MEM WB

IFRead

R1EX MEM WB

IF ID EX MEM WB

IFReadR4,R3

EX MEM WB

IFRead

R5EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

R2 is valid

R1 is valid

R3 is valid

R4 is valid

R5 is valid


Instruction flow

Solving conflicts from pipelining11

The conflicts can be solved by delaying instruction issue whenever it is necessary

Whenever a conflict is found the instruction pipeline is stalled

The real instruction throughput from pipelining is smaller than 5x

LI R2,100

LI R1,4

MOVE R3,R0

LW R4,100(R1)

SUB R5,R4,R3

BLEZ R5,LOOP_END

ADDI R1,R1,4

MOVE R3,R4

BNE R2,R1,LOOP_NXT

SLL R2,R2,2

LOOP_END:

LOOP_NXT:

IF ID EX MEM WB

IFRead

R2EX MEM WB

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

IF ID EX MEM WB

IF ID EX MEM WB

IFRead

R1EX MEM WB

IF ID EX MEM WB

IFRead

R4EX

IF

R2 is valid

R1 is valid

R3 is valid

R4 is valid

IDStall

IDStall

IFStall

IFStall

IDStall

IFStall

IDStall

IFStall

IDStall

IFStall

IDStall

IFStall

IDStall

IFStall

IDStall

IFStall


Instruction flow

Solving conflicts from pipelining12

The additional logic to detect and solve conflicts increases the clock period

The performance increase is even smaller than expected

The clock period must increase to allow for conflict detection and resolution

LI R2,100

LI R1,4

MOVE R3,R0

LW R4,100(R1)

SUB R5,R4,R3

BLEZ R5,LOOP_END

ADDI R1,R1,4

MOVE R3,R4

BNE R2,R1,LOOP_NXT

SLL R2,R2,2

LOOP_END:

LOOP_NXT:

IF ID EX MEM WB

IFRead

R2EX MEM WB

1 2 3 4 5 6 7 8 9 10 11 12 13 14

IF ID EX MEM WB

IF ID EX MEM WB

IFRead

R1EX MEM WB

IF ID EX MEM

IF

IDStall

IDStall

IFStall

IFStall

IDStall

IFStall

IDStall

IFStall

IDStall

IFStall

IDStall

IFStall

IDStall

IFStall


Processor performance

13

The processor performance depends on a number of factors:

Clock frequency

Instruction Set Architecture (e.g., RISC vs CISC)

ISA implementation (e.g., single cycle vs multi cycle vs pipeline)

Benchmarks (programs) used

Compiler optimizations

Memory bandwidth and latency

…

What is the best metric to assess processor performance?


Measuring processor performance

14

Frequency (GHz)

Does not take into account architectural differences (e.g., ISA)

MIPS (million instructions per second)

𝑀𝐼𝑃𝑆 =#𝐼𝑛𝑠𝑡𝑟𝑢𝑐𝑡𝑖𝑜𝑛𝑠

𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑇𝑖𝑚𝑒 [𝜇𝑠]=

𝐶𝑙𝑜𝑐𝑘 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦

𝐶𝑃𝐼 × 106

Valid only when using the exact same program, compiler and OS

Requires that both processors use the same ISA

1 CISC instruction is equivalent to several RISC instructions, but takes longer to execute

MFLOPS (million floating point operations per second)

Has the same problems as the MIPS metric

Valid only for floating point intensive programs

e.g., does not make sense for H.264 video compression

CPI = Cycles Per Instruction


Measuring processor performance

15

Use time to measure processor performance

Requires the implementation (or at least simulation) of the proposed

processor

𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑃 =1

𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑇𝑖𝑚𝑒 𝑃

What does it mean to say:

“Processor PA is x times faster than processor PB”

𝑥 =𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑃𝐴

𝑃𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 𝑃𝐵=

𝐸𝑥𝑒𝑐𝑢𝑡𝑖𝑜𝑛 𝑇𝑖𝑚𝑒 𝑃𝐵

𝐸𝑥𝑒𝑐𝑢𝑐𝑡𝑖𝑜𝑛 𝑇𝑖𝑚𝑒 𝑃𝐴 Speed-up

x is also called the speedup of processor PA versus processor PB


Measuring processor performance:

Task selection16

What are the best benchmarks?

Real applications of interest to the user

Different users different benchmarks

Representative programs, e.g., SPEC CPU 2006

Synthetic Programs, e.g., Dhrystone

Different metrics:

Execution rhythm (tasks/second) vs latency (seconds/task)

More realistic

Fit for real systems

Simpler programs

Easier to test in simulation



SPEC CPU 2006 (integer)17

Benchmark Lang. Application Area Description

400.Perlbench C Programming Language Derived from Perl V5.8.7. The workload includes SpamAssassin, MHonArc (an

email indexer), and specdiff (SPEC's tool that checks benchmark outputs).

401.bzip2 C CompressionJulian Seward's bzip2 version 1.0.3, modified to do most work in memory, rather

than doing I/O.

403.gcc C C Compiler Based on gcc Version 3.2, generates code for Opteron.

429.mcf C Combinatorial Optimization Vehicle scheduling. Uses a network simplex algorithm (which is also used in

commercial products) to schedule public transport.

445.gobmk C Artificial Intelligence: Go Plays the game of Go, a simply described but deeply complex game.

456.hmmer C Search Gene Sequence Protein sequence analysis using profile hidden Markov models (profile HMMs)

458.sjeng C Artificial Intelligence: chess A highly-ranked chess program that also plays several chess variants.

462.libquantum C Physics / Quantum ComputingSimulates a quantum computer, running Shor's polynomial-time factorization

algorithm.

464.h264ref C Video Compression A reference implementation of H.264/AVC, encodes a videostream using 2

parameter sets. The H.264/AVC standard is expected to replace MPEG2

471.omnetpp C++ Discrete Event Simulation Uses the OMNet++ discrete event simulator to model a large Ethernet campus

network.

473.astar C++ Path-finding Algorithms Pathfinding library for 2D maps, including the well known A* algorithm.

483.xalancbmk C++ XML Processing A modified version of Xalan-C++, which transforms XML documents to other

document types.



SPEC CPU 2006 (floating point)18

Benchmark Lang. Application Area Description

410.bwaves Fortran Fluid Dynamics Computes 3D transonic transient laminar viscous flow.

416.gamess Fortran Quantum Chemistry. Gamess implements a wide range of quantum chemical computations.

433.milc C Quantum Chromodynamics A gauge field generating program for lattice gauge theory programs.

434.zeusmp Fortran Physics / CFD Computational fluid dynamics code for simulating of astrophysical phenomena.

435.gromacs C,FortranBiochemistry / Molecular Dynamics

Molecular dynamics, i.e. simulate Newtonian equations of motion for hundreds to millions of particles. The test case simulates protein Lysozyme in a solution.

436.cactusADM C,Fortran Physics / General Relativity Solves the Einstein evolution equations using a staggered-leapfrog method

437.leslie3d Fortran Fluid Dynamics Large-Eddy Simulations with Linear-Eddy Model in 3D.

444.namd C++ Biology / Molecular Dynamics Simulates large biomolecular systems.

447.dealII C++ Finite Element Analysis Program library targeted at adaptive finite elements and error estimation.

450.soplex C++Linear Programming, Optimization

Solves a linear program using a simplex algorithm and sparse linear algebra.

453.povray C++ Image Ray-tracing Image rendering of a 1280x1024 anti-aliased landscape.

454.calculix C,Fortran Structural Mechanics Finite element code for linear and nonlinear 3D structural applications.

459.GemsFDTD FortranComputational Electromagnetics

Solves the 3D Maxwell equations in 3D using the finite-difference time-domain (FDTD) method.

465.Tonto Fortran Quantum Chemistry An open source quantum chemistry package

470.lbm C Fluid Dynamics Simulates incompressible fluids in 3D

481.wrf C,Fortran Weather Weather modeling from scales of meters to thousands of kilometers.

482.sphinx3 C Speech recognition A widely-known speech recognition system from Carnegie Mellon University



Averaging performance19

NormalAll benchmarks have the same weight

Weighted (𝑊𝒊)Benchmarks are weighted by

frequency or relevance

Arithmetic Mean1

𝑁

𝑖=1

𝑁

𝑇𝑖

𝑖=1

𝑁

𝑊𝑖𝑇𝑖

𝑖=1

𝑁

𝑊𝑖

Harmonic Mean(Less sensitive to large outliers

and increases the influence of

small values)

1

𝑁

𝑖=1

𝑁

𝑇𝑖

−1

=𝑛

𝑖=1𝑁 1 𝑇𝑖

𝑖=1𝑛 𝑤𝑖

𝑖=1𝑁 𝑤𝑖 𝑇𝑖

Alternative:

Instead of using Execution Time 𝑇𝑖

Use speedup regarding a standard reference, Speedup𝑖 = 𝑇𝑖𝑅𝑒𝑓

𝑇𝑖

SPECs use a SPARStation (SUN Sparc10) as reference



Amdahl's Law20

Consider that we improve processor performance by better designing some

part of it

E.g., improve floating point calculations by 3x

What is the actual improvement?

𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =𝑇

𝑇=

𝑇

𝑇 𝐹𝑃 + 𝑇(𝑁𝑜𝑛 𝐹𝑃)

The Non-FP instructions have the same execution time: 𝑇 𝑁𝑜𝑛 𝐹𝑃 = 𝑇(𝑁𝑜𝑛 𝐹𝑃)

FP Instructions execution time:

𝑇 𝐹𝑃 =𝑇(𝐹𝑃)

𝑆𝑝𝑒𝑒𝑑𝑢𝑝(𝐹𝑃)=

1

3𝑇(𝐹𝑃)

𝑇 – Execution time in the original processor 𝑇 – Execution time in the improved processor



Amdahl's Law21


part of it


What is the actual improvement?

𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =𝑇

𝑇(𝐹𝑃)

𝑆𝑝𝑒𝑒𝑑𝑢𝑝(𝐹𝑃)+𝑇(𝑁𝑜𝑛 𝐹𝑃)

=1

𝑇(𝐹𝑃)/𝑇

𝑆𝑝𝑒𝑒𝑑𝑢𝑝(𝐹𝑃)+𝑇(𝑁𝑜𝑛 𝐹𝑃)/𝑇

Lets us consider that, in the original processor, the fraction of time executing

floating point instructions is 𝛼 𝐹𝑃 =T FP

T= 0.25

𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =1

𝛼(𝐹𝑃)

𝑆𝑝𝑒𝑒𝑑𝑢𝑝(𝐹𝑃)+ 1−𝛼(𝐹𝑃)

=1

0.25

3+0.75

= 1.2




Amdahl's Law (corollary)22


part of it


What is the maximum improvement possible?

𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =1

𝛼(𝐹𝑃)

𝑆𝑝𝑒𝑒𝑑𝑢𝑝(𝐹𝑃)+ 1−𝛼(𝐹𝑃)

=1

0.25

3+0.75

= 1.2

Consider that 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 𝐹𝑃 → +∞

𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝑎𝑐ℎ𝑖𝑒𝑣𝑎𝑏𝑙𝑒 𝑆𝑝𝑒𝑒𝑑𝑢𝑝 =1

1−𝛼(𝐹𝑃)=

1

0.75= 1.33(3)




Amdahl's Law (summary)23

Execution Time:

𝑇𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑 𝑚𝑎𝑐ℎ𝑖𝑛𝑒 = 𝑇𝑅𝑒𝑓𝑒𝑟𝑒𝑛𝑐𝑒 𝑀𝑎𝑐ℎ𝑖𝑛𝑒 1 − 𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑 +𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑

𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑

Actual Speedup:

𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑 𝑚𝑎𝑐ℎ𝑖𝑛𝑒 =1

1−𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑 +𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑

𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑

Maximum achievable Speedup (𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑 → +∞):

𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝑆𝑝𝑒𝑒𝑑𝑢𝑝𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑 𝑀𝑎𝑐ℎ𝑖𝑛𝑒 =1

1−𝐹𝑟𝑎𝑐𝑡𝑖𝑜𝑛𝐼𝑚𝑝𝑟𝑜𝑣𝑒𝑑

More on processor pipelining

Conflict identification

Solving conflicts

Next lesson24

Documents

PIPELINING AND PROCESSOR PERFORMANCE - … · PIPELINING AND PROCESSOR PERFORMANCE ... Computer Architecture: A Quantitative Approach”, 5th edition, Chapter 1, John L. Hennessy