It’s all about latency

It’s all about latency

Henk NeefsDept. of Electronics and

Information Systems (ELIS)University of Gent

Overview• Introduction of processor model• Show importance of latency• Techniques to handle latency• Quantify memory latency effect• Why consider optical interconnects?• Latency of an optical interconnect• Conclusions

Out-of-order processor pipeline

I-cachefetch decode

instructionwindowrename

architecturalregister file

LDST

executionunits

‘future’register

file

INT

in-orderretirement

Branch latency

I-cachefetch decode

instructionwindowrename

LDST

executionunits

‘future’register

file

INT

BR

time

ADDORST XOR LD

ORBR ST XOR LD

... ... ...... ...... ......BR

latency

Eliminate branch latency

• By prediction:predict outcome of branch => eliminate dependency (with a high probability)

• By predication:convert control dependency to data dependency => eliminate control dependency

while (pointer!=0)pointer = pointer.next;

Load latency

Loop:LD R1, R1(32)BNE R1, Loop

cycles

LD

CPI = 2 cycles/2 instructions = 1 cycle/instruction

load latency = 2 cyclesbranch latency = 1 cycle

BNELD

BNELD

BNELD

execution units

When longer load latency

cycles

LD

CPI = 8 cycles/2 instructions = 4 cycles/instruction

load latency = 2+6 cyclesbranch latency = 1 cycle

BNE

BNE

BNE

execution units• When L1-cache missesand L2-cache hits:

LD

LD

LD

• When L2-cache missesand main memory hits:

load latency = 2+6+60 cyclesCPI = 34 cycles/instruction

Memory hierarchyregister file execution

unitsL1 cache

L2 cache

main memory

hard drive

storage capacityand latency

L1 cache latency

0

2

4

6

8

10

12

0 50 100 150 200 250 300instruction window size (#instructions)

IPC

latency = 2latency = 3latency = 4

loa d/store

IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs

Main memory latency

3

3.1

3.2

3.3

3.4

3.5

3.6

0 20 40 60 80 100

main memory latency (ns)

IPC

loa d/store

IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs

Performance and latencyInterconnect type Sensitivity of performance

to latency decrease(% per ns)

Processor core/register file 39

Processor/L1-cache 19

L1-cache/L2-cache 3,0

L2-cache/main memory 0,18

performance change = sensitivity * load latency change

Increase performance by• eliminating/reducing load latency:

– By prefetching:predict the next miss and fetch the datato e.g. L1-cache

– By address prediction:address known earlier=> load executed earlier=> data early in register file

• or reducing sensitivity to load latency:– by fine-grain multithreading

Some prefetch techniques• Stride prefetching:

search for pattern with constant stride

e.g. walking through a matrix (row- or column-order)

• Markov prefetching:recurring patterns of misses

20 31 42 53 64stride: 11

miss history prediction10 110 15 12 100 … ...

Stride prefetching

4.9

5

5.1

5.2

70 75 80 85 90latency main memory (ns)

IPC

prefetching no prefetching

IPC = Instructions Per clock Cycle, 1 Ghz processor, program: compress

loa d/store

Prefetching and sensitivity

Factors of “performance sensitivity to latency” increase with stride-prefetching:

L1-cache/L2-cache L2-cache/main memoryto L1-prefetching 1.6 4.1to L2-prefetching 2.5

Latency is important:generalization to other processor architecturesConsider schedule of program:

time

Present in everyprogram execution:• Latency of instruction

execution• Latency of

communication=> latency important

whatever processor architecture

Optical interconnects (OI)• Mature components:

– Vertical-Cavity Surface Emitting Lasers (VCSELs)

– Light Emitting Diodes (LEDs)• Very high bandwidths• Are replacing electronic interconnects in

telecom and networks• Useful for short inter-chip and even

intra-chip interconnects?

OI in processor context

• At levels close to processor core,latency is very important=> latency of OI determines how far OI penetrates in the memory hierarchy

• What is the latency of an optical interconnect?

An optical link

Total latency = buffer latency + VCSEL/LED latency + time of flight + receiver latency

LED/VCSEL

buffer/modulation/bias

fiber orlight conductor

receiver diode

transimpedance amplifier

VCSEL characteristics

0

0.5

1

1.5

2

0 1 2 3current (mA)

optic

al o

utpu

t (m

W)

optical power carrier density

loa d/store

• A small semiconductor laser• Carrier density should be high enough for lasing action

Total VCSEL link latencyconsists of

• Buffer latency• Parasitic capacitances and series

resistances of VCSEL and pads• Threshold carrier density build up• From low optical output to final optical

output (intrinsic latency)• Time of flight (TOF)• Receiver latency

Total optical link latency

loa d/store

0

1

2

3

4

5

6

7

LED LED VCSEL VCSEL

late

ncy

(ns)

TOF (10 cm)receiverintrinsicthresholdparasiticsbuffer

CMOS: 0.6 m 0.25 m 0.6 m 0.25 m

@ 1 mW

Latency as function of power

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6optical output power (mW)

late

ncy

(ns)

LED (0.6 microm.)VCSEL (0.6 microm.)LED (0.25 microm.)VCSEL (0.25 microm.)

loa d/store

Conclusions• When combining performance sensitivity

and optical latency we conclude:– optical interconnects are feasible to main

memory and for multiprocessors– for interconnects close to processor core,

optical interconnects have too high latencywith present (telecom) devices, drivers and receivers

=> but now evolution to lower latency devices, drivers and receivers is taking place...

For more information on the presented results: Henk Neefs, Latentiebeheersing in processors, PhD Universiteit Gent, January 2000www.elis.rug.ac.be/~neefs

Documents

It’s all about latency