25
It’s all about latency Henk Neefs Dept. of Electronics and Information Systems (ELIS) University of Gent

It’s all about latency

  • Upload
    jubal

  • View
    102

  • Download
    0

Embed Size (px)

DESCRIPTION

It’s all about latency. Henk Neefs Dept. of Electronics and Information Systems (ELIS) University of Gent. Overview. Introduction of processor model Show importance of latency Techniques to handle latency Quantify memory latency effect Why consider optical interconnects? - PowerPoint PPT Presentation

Citation preview

Page 1: It’s all about latency

It’s all about latency

Henk NeefsDept. of Electronics and

Information Systems (ELIS)University of Gent

Page 2: It’s all about latency

Overview• Introduction of processor model• Show importance of latency• Techniques to handle latency• Quantify memory latency effect• Why consider optical interconnects?• Latency of an optical interconnect• Conclusions

Page 3: It’s all about latency

Out-of-order processor pipeline

I-cachefetch decode

instructionwindowrename

architecturalregister file

LDST

executionunits

‘future’register

file

INT

in-orderretirement

Page 4: It’s all about latency

Branch latency

I-cachefetch decode

instructionwindowrename

LDST

executionunits

‘future’register

file

INT

BR

time

ADDORST XOR LD

ORBR ST XOR LD

... ... ...... ...... ......BR

latency

Page 5: It’s all about latency

Eliminate branch latency

• By prediction:predict outcome of branch => eliminate dependency (with a high probability)

• By predication:convert control dependency to data dependency => eliminate control dependency

Page 6: It’s all about latency

while (pointer!=0)pointer = pointer.next;

Load latency

Loop:LD R1, R1(32)BNE R1, Loop

cycles

LD

CPI = 2 cycles/2 instructions = 1 cycle/instruction

load latency = 2 cyclesbranch latency = 1 cycle

BNELD

BNELD

BNELD

execution units

Page 7: It’s all about latency

When longer load latency

cycles

LD

CPI = 8 cycles/2 instructions = 4 cycles/instruction

load latency = 2+6 cyclesbranch latency = 1 cycle

BNE

BNE

BNE

execution units• When L1-cache missesand L2-cache hits:

LD

LD

LD

• When L2-cache missesand main memory hits:

load latency = 2+6+60 cyclesCPI = 34 cycles/instruction

Page 8: It’s all about latency

Memory hierarchyregister file execution

unitsL1 cache

L2 cache

main memory

hard drive

storage capacityand latency

Page 9: It’s all about latency

L1 cache latency

0

2

4

6

8

10

12

0 50 100 150 200 250 300instruction window size (#instructions)

IPC

latency = 2latency = 3latency = 4

loa d/store

IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs

Page 10: It’s all about latency

Main memory latency

3

3.1

3.2

3.3

3.4

3.5

3.6

0 20 40 60 80 100

main memory latency (ns)

IPC

loa d/store

IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs

Page 11: It’s all about latency

Performance and latencyInterconnect type Sensitivity of performance

to latency decrease(% per ns)

Processor core/register file 39

Processor/L1-cache 19

L1-cache/L2-cache 3,0

L2-cache/main memory 0,18

performance change = sensitivity * load latency change

Page 12: It’s all about latency

Increase performance by• eliminating/reducing load latency:

– By prefetching:predict the next miss and fetch the datato e.g. L1-cache

– By address prediction:address known earlier=> load executed earlier=> data early in register file

• or reducing sensitivity to load latency:– by fine-grain multithreading

Page 13: It’s all about latency

Some prefetch techniques• Stride prefetching:

search for pattern with constant stride

e.g. walking through a matrix (row- or column-order)

• Markov prefetching:recurring patterns of misses

20 31 42 53 64stride: 11

miss history prediction10 110 15 12 100 … ...

Page 14: It’s all about latency

Stride prefetching

4.9

5

5.1

5.2

70 75 80 85 90latency main memory (ns)

IPC

prefetching no prefetching

IPC = Instructions Per clock Cycle, 1 Ghz processor, program: compress

loa d/store

Page 15: It’s all about latency

Prefetching and sensitivity

Factors of “performance sensitivity to latency” increase with stride-prefetching:

L1-cache/L2-cache L2-cache/main memoryto L1-prefetching 1.6 4.1to L2-prefetching 2.5

Page 16: It’s all about latency

Latency is important:generalization to other processor architecturesConsider schedule of program:

time

Present in everyprogram execution:• Latency of instruction

execution• Latency of

communication=> latency important

whatever processor architecture

Page 17: It’s all about latency

Optical interconnects (OI)• Mature components:

– Vertical-Cavity Surface Emitting Lasers (VCSELs)

– Light Emitting Diodes (LEDs)• Very high bandwidths• Are replacing electronic interconnects in

telecom and networks• Useful for short inter-chip and even

intra-chip interconnects?

Page 18: It’s all about latency

OI in processor context

• At levels close to processor core,latency is very important=> latency of OI determines how far OI penetrates in the memory hierarchy

• What is the latency of an optical interconnect?

Page 19: It’s all about latency

An optical link

Total latency = buffer latency + VCSEL/LED latency + time of flight + receiver latency

LED/VCSEL

buffer/modulation/bias

fiber orlight conductor

receiver diode

transimpedance amplifier

Page 20: It’s all about latency

VCSEL characteristics

0

0.5

1

1.5

2

0 1 2 3current (mA)

optic

al o

utpu

t (m

W)

optical power carrier density

loa d/store

• A small semiconductor laser• Carrier density should be high enough for lasing action

Page 21: It’s all about latency

Total VCSEL link latencyconsists of

• Buffer latency• Parasitic capacitances and series

resistances of VCSEL and pads• Threshold carrier density build up• From low optical output to final optical

output (intrinsic latency)• Time of flight (TOF)• Receiver latency

Page 22: It’s all about latency

Total optical link latency

loa d/store

0

1

2

3

4

5

6

7

LED LED VCSEL VCSEL

late

ncy

(ns)

TOF (10 cm)receiverintrinsicthresholdparasiticsbuffer

CMOS: 0.6 m 0.25 m 0.6 m 0.25 m

@ 1 mW

Page 23: It’s all about latency

Latency as function of power

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6optical output power (mW)

late

ncy

(ns)

LED (0.6 microm.)VCSEL (0.6 microm.)LED (0.25 microm.)VCSEL (0.25 microm.)

loa d/store

Page 24: It’s all about latency

Conclusions• When combining performance sensitivity

and optical latency we conclude:– optical interconnects are feasible to main

memory and for multiprocessors– for interconnects close to processor core,

optical interconnects have too high latencywith present (telecom) devices, drivers and receivers

=> but now evolution to lower latency devices, drivers and receivers is taking place...

For more information on the presented results: Henk Neefs, Latentiebeheersing in processors, PhD Universiteit Gent, January 2000www.elis.rug.ac.be/~neefs

Page 25: It’s all about latency