40
HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing Michael Adler Elliott Fleming Michael Pellauer Joel Emer

HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing

  • Upload
    tino

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing. Michael Adler Elliott Fleming Michael Pellauer Joel Emer. Simulating Multicores. Simulating an N- multicore target Fundametally N times the work Plus on-chip network. CPU. CPU. CPU. CPU. CPU. Network. - PowerPoint PPT Presentation

Citation preview

Page 1: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing

Michael AdlerElliott FlemingMichael PellauerJoel Emer

Page 2: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

2

Simulating Multicores

Simulating an N-multicore target•Fundametally N times the work•Plus on-chip network

Duplicating cores will quickly fill FPGAMulti-FPGA will slow simulation

CPU

CPU CPU CPUCPU

CPU CPU CPU CPU

Network

Page 3: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

3

Trading Time for Space

Can leverage separation of model clock and FPGA clock to save space• Two techniques: serialization and time-multiplexing

But doesn’t this just slow down our simulator?

The tradeoff is a good idea if we can:• Save a lot of space• Improve FPGA critical path• Improve utilization• Slow down rare events, keep common events fast

LI approach enables a wide range of tradeoff options

Page 4: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

4

Serialization: A First Tradeoff

Page 5: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

5

Example Tradeoff: Multi-Port Register File

2 Read Ports, 2 Write Ports• 5-bit index, 32-bit data• Reads take zero clock cycles

Virtex 2Pro FPGA: 9242 (>25%) slices, 104 MHz

2R/2WRegister

File

rd addr 1

rd addr 2

wr addr 1wr val 1

wr addr2wr val 2

rd val 1

rd val 2

Page 6: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

6

Trading Time for Space

Simulate the circuit sequentially using BlockRAM• 94 slices (<1%), 1 BlockRAM, 224 MHz (2.2x)• Simulation rate is 224 / 3 = 75 MHz

rd addr 1

rd addr 2

wr addr 1wr val 1

wr addr 2wr val 2

rd val 1

rd val 2

1R/1WBlockRAM

FSM

• Each module may have different FMR• A-Ports allow us to connect many such modules together• Maintain a consistent notion of model time

FPGA-cycle to Model Cycle Ratio(FMR)

Page 7: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

7

Example: Inorder Front End

FET

BranchPred

IMEM PCResolve

InstQ

I$

ITLB1 1 1 0

1

2

0

0first

deq

slot

enqor

drop

1

fault

mispred

1training

pred

rspImm

rspDel

1

1redirect

1vaddr

(from Back End)

vaddr

0

(from Back End)

paddr

0paddr

1

LinePred

00

instor

fault

Legend: Ready to simulate?

YesNo

FET

Part

IMEM

• Modules may simulate at any wall-clock rate• Corollary: adjacent modules may not be simulating the same

model cycle

Page 8: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

8

Simulator “Slip”

Adjacent modules simulating different cycles!• In paper: distributed resynchronization scheme

This can speed up simulation• Case study: Achieved 17% better performance than centralized controller• Can get performance = dynamic average

FET DEC1FET DEC1 vs

Let’s see how...

Page 9: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

9

Traditional Software Simulation

Wallclock time FET DEC EXE MEM WB

0 A1 A2 NOP3 NOP4 NOP5 NOP6 B7 B8 A9 A10 NOP

= model cycle

Page 10: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

10

2008.06.30

Challenges in Conducting Compelling Architecture Research10

Global Controller “Barrier” Synchronization

FPGA CC

FET DEC EXE MEM WB

0 A NOP NOP NOP NOP1 A2 A3 B A NOP NOP NOP4 B A5 A6 C B A NOP NOP7 B8 D C B A NOP9 D10 D

= model cycle

Page 11: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

11

A-Ports Distributed SynchronizationFPGA CC

FET DEC EXE MEM WB

0 A NOP NOP NOP NOP1 B A NOP NOP NOP2 C B A NOP NOP3 D B A NOP4 E

(full)B A

5 B A6 B A7 C B A8 F D C B A9 G

(full)D C B

10 D C11 D12 D

long-running opscan overlap evenif on different CC

run-ahead in timeuntil buffering fills

Takeaway: LI makes serialization tradeoffs more appealing

Page 12: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

12

Modeling large caches

Expensive instructions

CPU

Leveraging Latency-Insensitivity

1 1

FPU

EXE

LEAP InstructionEmulator

(M5)

RRR

[With Parashar,

Adler]

FPGA

1 1

L2$

CacheController

BRAM(KBs, 1 CC)

SRAM(MBs,

10s CCs) SystemMemory

(GBs, 100s CCs)

RAM256 KB

FPGA

LEAP

LEAP Scratchpad

Page 13: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

13

Time-Multiplexing: A Tradeoff to Scale Multicores

(resume at 3:45)

Page 14: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

14

Drawbacks:• Probably won’t fit• Low utilization of functional units

Benefits:• Simple to describe• Maximum parallelism

Multicores Revisited

What if we duplicate the cores?

state state state

CORE 0 CORE 1 CORE 2

Page 15: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

15

Module Utilization

FET DEC1FET DEC1

A module is unutilized on an FPGA cycle if:• Waiting for all input ports to be non-empty or• Waiting for all output ports to be non-full

Case Study: In-order functional units were utilized 13% of FPGA cycles on average

1 1

Page 16: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

16

• Drawbacks:• More expensive than

duplication(!)

Benefits:• Better unit utilization

Time-Multiplexing: First Approach

Duplicate state, Sequentially share logic

state

state

state physicalpipeline

virtualinstances

Page 17: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

17

• Drawbacks:• Head-of-line blocking may limit

performance

Benefits:• Much better area• Good unit utilization

Round-Robin Time Multiplexing

Fix ordering, remove multiplexors

statestatestate

physicalpipeline

• Need to limit impact of slow events• Pipeline at a fine granularity• Need a distributed, controller-free mechanism to coordinate...

Page 18: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

18

Port-Based Time-Multiplexing

• Duplicate local state in each module• Change port implementation:

• Minimum buffering: N * latency + 1• Initialize each FIFO with: # of tokens = N * latency

• Result: Adjacent modules can be simultaneously simulating different virtual instances

Page 19: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

19

The Front End Multiplexed

FET

BranchPred

IMEM PCResolve

InstQ

I$

ITLB1 1 1 0

1

2

0

0first

deq

slot

enqor

drop

1

fault

mispred

1training

pred

rspImm

rspDel

1

1redirect

1vaddr

(from Back End)

vaddr

0

(from Back End)

paddr

0paddr

1

LinePred

00

instor

fault

Legend: Ready to simulate?

CPU1No CPU

2

FET IMEM

Page 20: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

20

On-Chip Networks in a Time-Multiplexed World

Page 21: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

21

Problem: On-Chip Network

CPUL1/L2 $

msg credit

Memory Control

rr r r

[0 1 2] [0 1 2]

CPU 0L1/L2 $

CPU 1L1/L2 $

CPU 2L1/L2 $

r

router

msg msg

credit credit

• Problem: routing wires to/from each router• Similar to the “global controller” scheme• Also utilization is low

Page 22: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

22

Router0..3

Multiplexing On-Chip Network Routers

Router3

Router0

Router2

Router1

cur to 1 to 2 to 3 fr 1 fr 2 fr 30123

0

001

1

1 2 3

2

2 33

reorder

reorder

reorder

σ(x) = (x + 1) mod 4

σ(x) = (x + 2) mod 4

σ(x) = (x + 3) mod 4

1 2 3

0

001

12

2 33

Simulate the network without a network

Page 23: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

23

Ring/Double Ring Topology Multiplexed

Router3

Router0

Router2

Router1

Router0..3

“to next”“from prev”

???

cur to N fr P0

1

2

3

σ(x) = (x + 1) mod 4

1 3

0

012

23

Opposite direction: flip to/from

Page 24: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

24

Implementing Permutations on FPGAs Efficiently

Side Buffer•Fits networks like ring/torus (e.g. x+1 mod N)

Indirection Table•More general, but more expensive

PermTable

RAMBuffer

FSM

σ(x) = (x + 1) mod 4

1000 0001

Move first to Nth

Move Nth to first Move every K to N-K

Page 25: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

25

Torus/Mesh Topology Multiplexed

Mesh: Don’t transmit on non-existent links

Page 26: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

26

Dealing with Heterogeneous Networks

Compose “Mux Ports” with Permutation PortsIn paper: generalize to any topology

Page 27: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

27

Putting It All Together

Page 28: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

28

Typical HAsim Model Leveraging these Techniques

• 16-core chip multiprocessor• 10-stage pipeline (speculative, bypassed)• 64-bit Alpha ISA, floating point• 8 KB lockup-free L1 caches• 256 KB 4-way set associative L2 cache• Network: 2 v. channels, 4 slots, x-y wormhole

F BP1 BP2 PCC IQ D X DM CQ C

ITLB I$ DTLB D$ L/S Q

L2$ Route

• Single detailed pipeline, 16-way time-multiplexed• 64-bit Alpha functional partition, floating point• Caches modeled with different cache hierarchy• Single router, multiplexed, 4 permutations

Regs LUTs BRAM0%

25%

50%

75%

100%

Synthesis Results, percentage of Xilinx V5 330T

LEAPFuncOCNL1/L2Core

Page 29: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

29

Time-Multiplexed Multicore Simulation Rate Scaling

Best Worst Avg

FMR 15.7 27.1 18.4

Page 30: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

30

Time-Multiplexed Multicore Simulation Rate Scaling

Best Worst Avg

FMR Per-Core 5.4 14.4 8.95

Page 31: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

31

Time-Multiplexed Multicore Simulation Rate Scaling

Best Worst Avg

FMR Per-Core 8.5 13.5 11.6

Page 32: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

32

Time-Multiplexed Multicore Simulation Rate Scaling

Best Worst Avg

FMR Per-Core 8.45 19.8 11.5

Page 33: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

33

Takeaways

The Latency-Insensitive approach provides a unified approach to interesting tradeoffs

Serialization: Leverage FPGA-efficient circuits at the cost of FMR• A-Port-based synchronization can amortize cost by giving

dynamic average• Especially if long events are rare

Time-Multiplexing: Reuse datapaths and only duplicate state• A-Port based approach means not all modules are fully utilized• Increased utilization means that performance degradation is

sublinear• Time-multiplexing the on-chip network requires permutations

Page 34: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

34

Next Steps

Here we were able to push one FPGA to its limits

What if we want to scale farther?

Next, we’ll explore how latency-Insensitivity can help us scale to multiple FPGAs with better performance than traditional techniques

Also how we can increase designer productivity by abstracting platform

Page 35: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing
Page 36: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

36

Resynchronizing Ports

Modules follow modified scheme:• If any incoming port is heavy, or any outgoing port is light, simulate next

cycle (when ready)• Result: balanced w/o centralized coordination

Argument: • Modules farthest ahead in time will never proceed• Ports in (out) of this set will be light (resp. heavy)

– Therefore those modules will try to proceed, but may not be able to

• There’s also a set farthest behind in time– Always able to proceed– Since graph is connected, simulating only enables modules, makes progress

towards quiescence

Page 37: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

37

Other Topologies

Tree

Butterfly

[1 , 1 , 1 , 0 , 0 , 0 , 0 ] [1 , 1 , 1 , 0 , 0 , 0 , 0 ][0 , 0 , 0 , 1 , 1 , 1 , 1 ]

[2 , 0 , 1 , 0 , 1 , 0 , 1 ] [0 , 1 , 2 , 1 , 2 , 1 , 2 ]

P h ys ica lR ou ter

[0 , 0 , 0, 1 , 1 , 1 , 1 ]

R ou ter0

R ou ter2

R ou te r1

R ou te r6

R ou ter5

R ou te r4

R ou ter3

[0 , 0 , 1 , 1 , 0 , 1 , 0 , 1 , 2 , 2 , 2 , 2 ] [1 , 1 , 2 , 2 , 1 , 2 , 1 , 2 , 0 , 0 , 0 , 0 ]

[0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 ]

[0 , 1 , 0 , 1 , 0 , 1 , 0 , 1 ]

[2 , 2 , 2 , 2 , 0 , 0 , 1 , 1 , 0 , 1 , 0 , 1 ]

F rom P hys ica l C o re

To P hys ica l C ore

[2 , 2 , 2 , 2 , 0 , 0 , 1 , 1 , 0 , 1 , 0 , 1 ]

P hys ica lR ou ter

To C ore 0

To C o re 1

R ou te r8

To C ore 2

To C ore 3

R ou te r9

To C ore 4

To C ore 5

R ou te r10

To C ore 6

To C ore 7

R ou te r11

R ou ter4

R ou ter5

R ou ter6

R ou ter7

R outer0

R outer1

R outer2

R outer3

F rom C ore 0

F rom C ore 1

F rom C ore 2

F rom C ore 3

F rom C ore 4

F rom C ore 5

F rom C ore 6

F rom C ore 7

Page 38: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

38

Generalizing OCN Permutations

•Represent model as Directed Graph G=(M,P)•Label modules M with simulation order: 0..(N-1)•Partition ports into sets P0..Pm where:

– No two ports in a set Pm share a source– No two ports in a set Pm share a destination

• Transform each Pm into a permutation σm

– Forall {s, d} in Pm, σm(s) = d– Holes in range represent “don’t cares”– Always send NoMessage on those steps

• Time-Multiplex module as usual– Associate each σm with a physical port

Page 39: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

39

Example: Arbitrary Network

0

4

3

2

15

A

1032

543210

10543210

14543210

C

0

2

1

B

(1, 0)(3, 1)

P0

P1

P2

(5, 1)

(1, 2)(2, 3)(4, 0)

(0, 4)(4, 1)

Page 40: HAsim FPGA-Based Processor Models:    Multicore Models and     Time-Multiplexing

40

Results: Multicore Simulation Rate

FMR Simulation Rate

Min Max Avg Min Max Avg

Overall 16 218 80 160 KHz 3.2 MHz 625 KHz

Per-Core 5 27 11 1.84 9.5 MHz 4.54 MHz

• Must simulate multiple cores to get full benefit of time-multiplexed pipelines

• Functional cache-pressure rate-limiting factor