39
GP#GPUs Christos Kozyrakis h.p://cs316.stanford.edu CS316 – Fall 2014 – Lecture 12

Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

GP#GPUs&&

Christos(Kozyrakis((

h.p://cs316.stanford.edu(

CS316&–&Fall&2014&–&Lecture&12&

Page 2: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

2

Announcements&

!  Recommended(reading(!  J.(Hennessy(&(D.(Pa.erson,(Computer(Architecture,(chapter(4(!  D.(Pa.erson(&(J.(Hennessy,(Computer(OrganizaHon,(appendix(A(

!  Wri.en(by(J.(Nickolls(&(D.(Kirk(from(Nvidia(!  Watch(out(for(insHtuHonal(bias(

!  Both(available(in(the(engineering(library(

!  Credits:(Tor(Aamodt,(UBC(!  Some(slides(from(his(tutorial(on(GPU(Architectures(

!  Reminders(!  HW2(and(project(

Page 3: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

3

Reminder:&Advantages&of&Vector&ISAs&

!  Compact:(single(instrucHon(defines(N(operaHons(!  AmorHzes(the(cost(of(instrucHon(fetch/decode/issue(!  Also(reduces(the(frequency(of(branches(

!  Parallel:(N(operaHons(are((data)(parallel(!  No(dependencies(((!  No(need(for(complex(hardware(to(detect(parallelism((similar(to(VLIW)(!  Can(execute(in(parallel(assuming(N(parallel(datapaths(

!  Expressive:(memory(operaHons(describe(pa.erns(!  ConHnuous(or(regular(memory(access(pa.ern(!  Can(prefetch(or(accelerate(using(wide/mulH_banked(memory(!  Can(amorHze(high(latency(for(1st(element(over(large(sequenHal(pa.ern(

Page 4: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

4

Intel&Xeon&Phi&(aka&Knights&Corner)&

!  Vector((512b,(4(lanes)(+(mulH_threaded((4x)(+(mulH_core((>60)(!  But(in_order,(2_way(issue,(and(1.1GHz(!  Why?((

L2 Ctl

L1 TLB and 32KB

Code Cache

T0 IP

4 Threads In-Order

TLB Miss

Code Cache Miss

Decode uCode

16B/Cycle (2 IPC)

Pipe 0

X87 RF Scalar RF

X87 ALU 0 ALU 1

VPU RF

VPU 512b SIMD

Pipe 1

TLB Miss Handler

L2 TLB

T1 IP T2 IP T3 IP

L1 TLB and 32KB Data Cache DCache Miss

TLB Miss

To On-Die Interconnect

HWP

Core

512KB L2 Cache

Page 5: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

5

Vector&Unit&Design&

!  Vector(ISA(!  32(vector(registers((512b),(8(mask(registers,(sca.er/gather(

!  Microarchitecture(features(!  Fast(read(from(L1,(numeric(type(conversion(on(register(read,(…(

PPF PF D0 D1 D2 E WB

VC2 V1-V4 WB D2 E VC1

VC2 V1 V2 D2 E VC1 V3 V4

DEC VPU RF

3R, 1W

Mask RF

Scatter Gather

ST

LD

EMU Vector ALUs

16 Wide x 32 bit 8 Wide x 64 bit

Fused Multiply Add

Page 6: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

6

Vector&FuncEonal&Units&

!  16_wide(SP(SIMD,(8_wide(DP(SIMD(

Shared Multiplier Circuit for SP/DP

RF3 RF2 RF1 RF0

SP 15 DP

7 SP 14

SP 13 DP

6 SP 12

SP 11 DP

5 SP 10

SP 9 DP

4 SP 8

SP 7 DP

3 SP 6

SP 5 DP

2 SP 4

SP 3 DP

1 SP 2

SP 1 DP

0 SP 0

Page 7: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

7

Gather/ScaGer&

!  Gather/sca.er(takes(advantage(of(cache(locality(

gather-prime loop: gather-step; jump-mask-not-zero loop

Index0

+

Base Address

Addr0

Index1

+

Addr1

Index2

+

Addr2

Index3

+

Addr3

Index4

+

Addr4

Index5

+

Addr5

Index6

+

Addr6

Index7

+

Addr7 1 1

1 1

1 1

1 1

Clear

Clear = =

Access Address

Find First

Gather/Scatter machine takes advantage of cache-line locality

Gather Instruction Loop

Scalar Register

Vector Register

Mask Register

To TLB/ DCACHE

Page 8: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

8

Graphics&Processors&(GPUs)&

Page 9: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

9

GPUs&Timeline&

!  Till(mid(90s(!  VGA(controllers((used(to(accelerate(some(display(funcHons(

!  Mid(90s(to(mid(00s(!  Fixed_funcHon(graphics(accelerators(for(the(OpenGL(and(DirectX(APIs(

!  Some(GP_GPU(capabiliHes(by(on(top(of(the(interfaces(!  3D(graphics:(triangle(setup(&(rasterizaHon,(texture(mapping(&(shading(

!  Modern(GPUs(!  Programmable(mulHprocessors(opHmized(for(data_parallel(ops(

!  OpenGL/DirectX(and(general(purpose(languages(CUDA,(openCL,(…(!  Some(fixed(funcHon(hardware((texture,(raster(ops,(…)(!  Oken(integrated(in(the(same(chip(with(mulH_core(CPU((why?)(

!  Otherwise(as(a(PCIe_based(accelerator(

Page 10: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

10

Our&Focus&Today&

!  GPUs(as(programmable(mulH_core(chips(!  Hardware(architecture(and(sokware(model(

!  A(good(way(to(think(of(GPUs(!  MulH_core(chips,(where(every(core(is(a(threaded(SIMD/vector(core(!  Not(100%(accurate(but(good(enough(as(a(model(for(SW(developers(

!  For(the(graphics(view(of(the(world,(refer(to(the(graphics(courses(

!  Nvidia_biased(lecture(!  They(tend(to(be(more(open(about(their(architecture(!  Some(notes(on(ATI/AMD(towards(the(end(

Page 11: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

11

GPU&Thread&Model&SoMware&View&

!  Single(instrucHon(mulHple(threads(!  (SIMT)(

!  Each(thread(has(local(memory(

!  Parallel(threads(packed(in(blocks(!  Access(to(per_block(shared(memory(!  Can(synchronize(with(barrier(

!  Grids(include(independent(groups(!  May(execute(concurrently(

Page 12: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

12

Code&Example:&SAXPY&C&Code& CUDA&Code&

!  CUDA(code(launches(256(threads(per(block(!  Thread(=(1(iteraHon(of(scalar(loop((1(element(op(in(vector(code)(!  Block(=(body(of(vectorized(loop((with(VL=256(in(this(example)(!  Grid(=(vectorizable(loop((mulHple(iteraHons(of(vectorized(loops(body)(

!  Moves(parallelizaHon(from(compiler(to(programmer(!  Hopefully(program(wri.en(once(but(scales(to(many(chips(

Page 13: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

13

GPU&Microarchitecture&(10,000&feet)&

Single-Instruction, Multiple-Threads

GPU

Interconnection Network

SIMT Core Cluster

SIMT&Core&

SIMT&Core&

Memory&ParEEon&

GDDR3/GDDR5&

Memory&ParEEon&

GDDR3/GDDR5&

Memory&ParEEon&

GDDR3/GDDR5& Off#chip&DRAM

SIMT Core Cluster

SIMT&Core&

SIMT&Core&

SIMT Core Cluster

SIMT&Core&

SIMT&Core&

Page 14: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

14

Example&GPU&Architecture:&NVIDIA&Tesla&

Streaming multiprocessor

8 × Streaming processors

Page 15: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

15

Example&GPU&Architecture:&Nvidia&Kepler&GK110&

!  15(SMX(processors(!  Shared(L2(cache(!  6(memory(

controllers(!  1TFLOPS(double_

precision((

!  HW_based(thread(scheduling(

��

An Overview of the GK110 Kepler Architecture Kepler�GK110�was�built�first�and�foremost�for�Tesla,�and�its�goal�was�to�be�the�highest�performing�parallel�computing�microprocessor�in�the�world.�GK110�not�only�greatly�exceeds�the�raw�compute�horsepower�delivered�by�Fermi,�but�it�does�so�efficiently,�consuming�significantly�less�power�and�generating�much�less�heat�output.��

A�full�Kepler�GK110�implementation�includes�15�SMX�units�and�six�64Ͳbit�memory�controllers.��Different�products�will�use�different�configurations�of�GK110.��For�example,�some�products�may�deploy�13�or�14�SMXs.��

Key�features�of�the�architecture�that�will�be�discussed�below�in�more�depth�include:�

x The�new�SMX�processor�architecture�x An�enhanced�memory�subsystem,�offering�additional�caching�capabilities,�more�bandwidth�at�

each�level�of�the�hierarchy,�and�a�fully�redesigned�and�substantially�faster�DRAM�I/O�implementation.�

x Hardware�support�throughout�the�design�to�enable�new�programming�model�capabilities�

Kepler�GK110�Full�chip�block�diagram�

Page 16: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

16

Streaming&&MulEprocessor&(SMX)&

!  The(core(!  MulHthreded(!  Data(parallel(

!  CapabiliHes(!  64K(registers(!  192(simple(cores(

!  Int(and(SP(FPU(!  64(DP(FPUs(!  32(LSUs,(32(SFUs(

!  Scheduling(!  4(warp(schedulers(!  2_way(dispatch(per(warp(

��

Streaming�Multiprocessor�(SMX)�Architecture�

Kepler�GK110’s�new�SMX�introduces�several�architectural�innovations�that�make�it�not�only�the�most�powerful�multiprocessor�we’ve�built,�but�also�the�most�programmable�and�powerͲefficient.��

SMX:�192�singleͲprecision�CUDA�cores,�64�doubleͲprecision�units,�32�special�function�units�(SFU),�and�32�load/store�units�(LD/ST).�

Page 17: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

17

SIMT&ExecuEon&Model&

!  Programmers(sees(MIMD(threads((scalar)(!  GPU(HW(bundles(threads(into(warps(and(runs(them(in(lockstep(on(vector_like(hardware((SIMD)(

A: v = foo[tid.x];

B: if (v < 10)

C: v = 0;

else

D: v = 10;

E: w = bar[tid.x]+v;

Time

A T1 T2 T3 T4

B T1 T2 T3 T4

C T1 T2

D T3 T4

E T1 T2 T3 T4

foo[] = {4,8,12,16};

Page 18: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

18

&InstrucEon&&&Thread&Scheduling:&&Where&Threads&Meets&Data&Parallelism&

!  In(theory,(all(threads(can(be(independent(!  HW(implements(zero_overhead(switching(

!  For(efficiency,(32(threads(are(packed(in(warps(!  Warp:(set(of(parallel(threads(the(execute(same(instrucHon(

!  Wrap(=(a(thread(of(vector(instrucHons(!  Warps(introduce(data(parallelism((

!  1(warp(instrucHon(keeps(cores(busy(for(mulHple(cycles(!  Individual(threads(may(be(inacHve(

!  Because(they(branched(differently(or(predicaHon(!  This(is(the(equivalent(of(condiHonal(execuHon(!  Loss(of(efficiency(if(not(data(parallel(

!  SW(thread(blocks(mapped(to(warps(!  When(HW(resources(are(available(

Page 19: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

19

Inside&a&SIMT&Core&

!  SIMT(front(end(/(SIMD(backend(!  Fine_grained(mulHthreading(

!  Interleave(warp(execuHon(to(hide(latency(!  Register(values(of(all(threads(stays(in(core(

SIMT Front End SIMD Datapath

Fetch Decode

Schedule Branch

Memory Subsystem Icnt. Network SMem L1 D$ Tex $ Const$

Reg File

Page 20: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

20

SIMT Front End

Inside&an&“NVIDIA#style”&SIMT&Core&

SIMD Datapath

ALU ALU ALU

I-Cache Decode

I-Buffer

Score Board

Issue Operand Collector

MEM

ALU

Fetch SIMT-Stack

Done (WID)

Valid[1:N]

Branch Target PC

Pred. Active Mask

!  Three(decoupled(warp(schedulers(!  Scoreboard(!  Large(register(file(! MulHple(SIMD(funcHonal(units(

Scheduler 1

Scheduler 2

Scheduler 3

Page 21: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

21

Fetch&+&Decode&

!  Arbitrate(the(I_cache(among(warps(!  Cache(miss(handled(by(fetching(again(later(

!  Fetched(instrucHon(is(decoded(and(then(stored(in(the(I_Buffer(!  1(or(more(entries(/(warp(!  Only(warp(with(vacant(entries(are(considered(in(fetch(

Inst. W1 r Inst. W2 Inst. W3

v r v r v

To Fetch

Issue

Decode Score- Board

Issue ARB

PC 1 PC 2 PC 3

A R B

Selection T o I -

C a c h

e

Valid[1:N]

I-Cache Decode

I-Buffer

Fetch Valid[1:N]

Page 22: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

22

InstrucEon&Issue&

!  Select(a(warp(and(issue(an(instrucHon(from(its((I_Buffer(for(execuHon(!  Scheduling:(Greedy_Then_Oldest((GTO)(!  GT200/later(Fermi/Kepler:((Allow(dual(issue((superscalar)(

!  Fermi:(Odd/Even(scheduler(!  To(avoid(stalling(pipeline(might(((((keep(instrucHon(in(I_buffer(unHl(((((know(it(can(complete((replay)(

Inst. W1 rInst. W2Inst. W3

vrvrv

ToFetch

Issue

DecodeScore-Board

IssueARB

Page 23: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

23

In#Order&Scoreboard&

!  Check(for(RAW(and(WAW(hazard(!  InstrucHons(reserves(registers(at(issue(!  Release(them(at(writeback(!  ImplementaHon?(

!  Flag(instrucHons(with(hazards(as(not&ready(in(I_Buffer(so(not(considered(by(scheduler(

!  Track(up(to(6(registers(per(warp((out(of(128)(!  I_buffer(6_entry(bitvector:(1b(per(register(dependency(!  Lookup(source(operands,(set(bitvector(in(I_buffer.(As(results(wri.en(per(warp,(clear(corresponding(bit(

Page 24: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

24 24 !

- G 1111 TOS

B

C D

E

F

A

G

SIMT&&&Branches!

Thread Warp Common PC

Thread 2

Thread 3

Thread 4

Thread 1

B/1111

C/1001 D/0110

E/1111

A/1111

G/1111

- A 1111 TOS E D 0110 E C 1001 TOS

- E 1111 E D 0110 TOS - E 1111

A D G A

Time

C B E

- B 1111 TOS - E 1111 TOS Reconv. PC Next PC Active Mask

Stack

E D 0110 E E 1001 TOS

- E 1111

Page 25: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

25

Tracking&Branch&Divergence&

!  Similar(to(vector(processors(but(masks(handled(internally(!  No(explicit(mask(register(!  Per(warp(stack(that(stores(PCs(and(masks(for(“not(taken”(paths(

!  On(a(condiHonal(branch(!  Push(the(current(mask(onto(stack(!  Push(the(mask(and(PC(for(the(“not(taken”(path(!  Set(the(mask(for(the(“taken”(path(and(execute(

!  At(the(end(of(the(“taken”(path(!  Pop(the(mask(and(PC(for(the(“not(taken”(path(and(execute(

!  At(the(end(of(the(“not(taken”(path(!  Pop(the(original(mask(before(the(branch(instrucHon(

!  If(a(mask(is(all(zeros,(the(block(is(skipped(

Page 26: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

26

Register&File&

!  32 warps, 32 threads per warp, 16 x 32-bit registers per thread = 64KB register file. !  Need “4 ports” (e.g., FMA) greatly increase area.

!  Alternative: banked single ported register file !  Conflicts avoided using arbitrator + collector (a small issue window)(

Page 27: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

27

Lost&in&TranslaEon:&Vector&Vs.&GPU&

Page 28: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

28

Lost&in&TranslaEon:&GPU&"&Vector&&

!  From(Computer(Architecture,(4th(ediHon(by(J.(Hennessy(and(D.(Pa.erson(

Page 29: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

29

Memory&Hierarchy&

!  Each(SMX(has(64KB(of(memory(!  Split(between(shared(mem(and(L1(cache(

!  16/48,(32/32,(48/16(!  256B(per(access(

!  48KB(read_only(data(cache(!  Unified(address(

!  1.5MB(shared(L2(!  Supports(synchronizaHon(operaHons(!  AtomicCAS,(atomicADD,(…(

!  R/W(memories(use(ECC(!  RO(memories(use(parity(

��

Kepler�Memory�Subsystem�–�L1,�L2,�ECC�

Kepler’s�memory�hierarchy�is�organized�similarly�to�Fermi.�The�Kepler�architecture�supports�a�unified�memory�request�path�for�loads�and�stores,�with�an�L1�cache�per�SMX�multiprocessor.�Kepler�GK110�also�enables�compilerͲdirected�use�of�an�additional�new�cache�for�readͲonly�data,�as�described�below.�

64�KB�Configurable�Shared�Memory�and�L1�Cache�

In�the�Kepler�GK110�architecture,�as�in�the�previous�generation�Fermi�architecture,�each�SMX�has�64�KB�of�onͲchip�memory�that�can�be�configured�as�48�KB�of�Shared�memory�with�16�KB�of�L1�cache,�or�as�16�KB�of�shared�memory�with�48�KB�of�L1�cache.�Kepler�now�allows�for�additional�flexibility�in�configuring�the�allocation�of�shared�memory�and�L1�cache�by�permitting�a�32KB�/�32KB�split�between�shared�memory�and�L1�cache.�To�support�the�increased�throughput�of�each�SMX�unit,�the�shared�memory�bandwidth�for�64b�and�larger�load�operations�is�also�doubled�compared�to�the�Fermi�SM,�to�256B�per�core�clock.�

48KB�ReadͲOnly�Data�Cache�

In�addition�to�the�L1�cache,�Kepler�introduces�a�48KB�cache�for�data�that�is�known�to�be�readͲonly�for�the�duration�of�the�function.�In�the�Fermi�generation,�this�cache�was�accessible�only�by�the�Texture�unit.�Expert�programmers�often�found�it�advantageous�to�load�data�through�this�path�explicitly�by�mapping�their�data�as�textures,�but�this�approach�had�many�limitations.��

Page 30: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

30

Thread&SynchronizaEon&

!  Barrier(synchronizaHon(within(a(thread(block(!  Tracking(simplified(by(grouping(threads(into(wraps(!  Counter(used(to(track(number(of(threads(that(have(arrived(to(barrier(

!  Atomic(operaHons(to(L2/global(memory(!  Atomic(read_modify_write((add,(min,(max,(and,(or,(xor)(!  Atomic(exchange(or(compare(and(swap(!  They(are(Hed(to(L2(latency((

Page 31: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

31

Hardware&&MulE#core&Scheduling&

!  HW(unit(schedules(grids(on(SMX((!  Priority(based(scheduling(

!  32(acHve(grids(!  More(queued/paused(

!  Grids(launched(by(CPU(or(GPU(!  Work(from(mulHple(CPU(cores(

��

The�redesigned�Kepler�HOST�to�GPU�workflow�shows�the�new�Grid�Management�Unit,�which�allows�it�to�manage�the�actively�dispatching�grids,�pause�dispatch,�and�hold�pending�and�suspended�grids.�

NVIDIA�GPUDirect™�

When�working�with�a�large�amount�of�data,�increasing�the�data�throughput�and�reducing�latency�is�vital�to�increasing�compute�performance.�Kepler�GK110�supports�the�RDMA�feature�in�NVIDIA�GPUDirect,�which�is�designed�to�improve�performance�by�allowing�direct�access�to�GPU�memory�by�thirdͲparty�devices�such�as�IB�adapters,�NICs,�and�SSDs.�When�using�CUDA�5.0,�GPUDirect�provides�the�following�important�features:�

x Direct�memory�access�(DMA)�between�NIC�and�GPU�without�the�need�for�CPUͲside�data�buffering.�

x Significantly�improved�MPISend/MPIRecv�efficiency�between�GPU�and�other�nodes�in�a�network.�x Eliminates�CPU�bandwidth�and�latency�bottlenecks�x Works�with�variety�of�3rdͲparty�network,�capture,�and�storage�devices�

Page 32: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

32

Discussion&

!  How(do(we(get(data(in(&(out(a(GPU(!  Challenges?(!  SoluHons?(

!  How(would(you(connect(two(GPUs?(!  How(would(you(connect(10(GPUs?(

!  Do(GPUs(need(caches?(

Page 33: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

33

AMD/ATI&GPUs&

!  Source:(2012(Hot(Chips(talk(on(Radeon(HD(7970(!  Available(at(hotchips.org((

3 | GCN | HotChips 2012

AMD RADEON��HD 7970 ARCHITECTURE

Graphic Core Next (GCN) � 4.3 billion 28nm transistors

Page 34: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

34

AMD/ATI&GPUs:&Graphic&Core&Next&

!  Memory(system(!  16KB($I(per(4(CUs(!  32KB(R/W($D(per(CU(!  32KB(scalar($D(per(4(CUs(!  768KB(R/W(shared(L2(!  64KB(shared(memory(for(synchronizaHon(

!  6(GDDR5(interfaces(!  264GB/sec((

!  ECC(protecHon(6 | GCN | HotChips 2012

AMD RADEON��HD 7970 ARCHITECTURE

Graphic Core Next (GCN) � 384-bit GDDR5 - 264GB/Sec

� Unified R/W Cache Hierarchy

� 768KB R/W L2 Cache

� 16KB R/W L1 Per CU

� 16KB Instruction Cache(I$)/4CU

� 32KB Scalar Data Cache(K$)/4CU

Page 35: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

35

AMD/ATI&GPUs:&GCN&Compute&

! MulHthreaded((mulHple(kernels)(!  Vector(+(wide(issue(

!  4_way(issue(!  16_element(vectors(

18 | GCN | HotChips 2012

GCN COMPUTE UNIT

� Basic GPU building block of unified shader system � New instruction set architecture

x Non-VLIW x Vector unit + scalar co-processor x Distributed programmable scheduler x Unstructured flow control, function calls, recursion, Exception Support x Un-Typed, Typed, and Image Memory operations

� Each compute unit can execute instructions from multiple kernels simultaneously � Designed for programming simplicity, high utilization, high throughput, with multi-tasking

Branch & Message Unit

Scalar Unit

Vector Units (4x SIMD-16)

Vector Registers (4x 64KB)

Texture Filter Units (4)

Local Data Share (64KB)

L1 Cache (16KB)

Scheduler Texture Fetch

Load / Store Units (16)

Scalar Registers (4KB)

Page 36: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

36

AMD/ATI&GPUs:&GCN&Compute&

!  MulHthreaded((mulHple(kernels)(!  Vector(+(wide(issue(

!  4_way(issue,(16_element(vectors(

19 | GCN | HotChips 2012

GCN COMPUTE UNIT (CU) ARCHITECTURE

Input Data: PC/State/Vector Register/Scalar Register

SIMD PC & IB

Instru

ction F

etch

Arb

itratio

n

4 CU Shared 32KB Instruction L1

R/W L2

Instru

ction A

rbitra

tion

4 CU Shared 16KB Scalar Read Only L1 Rqst Arb

Msg Bus

Scalar Decode

Integer ALU

8 KB Registers

Scalar Unit

Vector Decode

Vector Memory Decode

R/W L2

Export/GDS Decode

Export Bus

MP Vector ALU

64 KB Registers

SIMD3

64 KB LDS Memory LDS Decode

MP Vector ALU

64 KB Registers

SIMD0

SIMD PC & IB

SIMD PC & IB MP

Vector ALU

64 KB Registers

SIMD2

SIMD PC & IB

MP Vector ALU

64 KB Registers

SIMD1

Branch & MSG Unit

R/W data L1

16KB

http://developer.amd.com/afds/assets/presentations/2620_final.pdf

Page 37: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

37

AMD/ATI&GPUs:&Local&Data&Share&Access&

!  32_bank(sokware(managed(structure(!  High(bandwidth(for(sequenHal(and(indexed(pa.erns(!  Support(for(synchronizaHon((barriers)(20 | GCN | HotChips 2012

LOCAL DATA SHARED MEMORY ARCHITECTURE

� 64 kbyte, 32 bank Shared Memory � Direct mode ± Interpolation @ rate or 1 broadcast

read 32/16/8 bit � Index Mode ± 64 dwords per 2 clks - Service 2

waves per 4 clks

� Advantages � Low Latency and Bandwidth amplifier for lower power � Software managed cache � Software consistency/coherency - thread group via

Hardware barrier

Page 38: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

38

AMD/ATI&GPUs:&Cache&Hierarchy&

!  L2(is(coherent(!  Relaxed(consistency(model(

30 | GCN | HotChips 2012

R/W CACHE HIERARCHY

L2

L1 read/write 16kb write through caches

64 Bytes / CU / clock

L2 read/write cache partitions (64kb/128kb) write back caches

64 Bytes / partition / clock

Each CU has 256kb registers and 64kb local data share

K$ I$ 16KB instruction cache (I$) + 32 KB scalar data cache (K$)

shared per 4 CUs with L2 backing

K$ I$

GDS

Global data share facilitates synchronization between CUs

L2 L2

K$

L1 L1 L1 L1 L1 L1 L1 L1 L1

64b Dual Channel Memory Controller

64b Dual Channel Memory Controller

64b Dual Channel Memory Controller

Page 39: Christos(Kozyrakis( h.p://cs316.stanford.edu(acs.pub.ro/~cpop/SMPA/CS316_Advanced Multi-core Systems... · 2021. 3. 3. · 7 Gather/ScaGer&! Gather/sca.er(takes(advantage(of(cache(locality(gather-prime

39

Summary&

!  GPUs(!  Massively(parallel(processors(!  Data(parallelism,(threading,(mulH_core(!  Getng(more(general_purpose(every(day(

!  The(driving(force(for(gaming(and(HPC(