27
Computer Architecture Lab at ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi, Ken Mai PROTOFLEX Our work in this area has been supported in part by NSF, IBM, Intel, Sun, and Xilinx.

ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

Computer Architecture Lab at

ProtoFlex Tutorial:Full-System MP Simulations Using FPGAs

Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi,James C. Hoe, Babak Falsafi, Ken Mai

PROTOFLEX

Our work in this area has been supported in part by NSF, IBM, Intel, Sun, and Xilinx.

Page 2: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 2

Overview

• Part I: Basic ProtoFlex Concepts

– FPGA Virtualization techniques

– BlueSPARC FPGA-accelerated simulator

• Part II: Hands-on Technology Preview

– Access CMU’s remote FPGA infrastructure

– Compile and run code within an FPGA-accelerated simulation of 16-CPU UltraSPARC III server

– Run real-time CMP cache simulations

Page 3: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 3

Part I: Basic Concepts

Page 4: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 4

Full-system Functional Simulators

• Convenient exploration of real (or future) HW

– Can boot OS, run commercial apps

• But too slow for large-scale (>100-way) MP studies

– Existing functional simulators single-threaded

– High instrumentation overhead

Page 5: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 5

FPGA-based simulation

• Advantages

– Fast and flexible

– Low HW instrumentation performance overhead

Page 6: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 6

FPGA-based simulation

• Caveat: simulation w/ FPGAs more complex than SW

– Large-MP configurations (>100) expensive to scale

– Full-system support need device models + full ISA

Page 7: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 7

Reducing Complexity

Hybrid Full-System Simulation

Multiprocessor Host Interleaving

21

CPUP

Memory Devices

P

Common-case behaviors

Uncommon behaviors Memory

4-way P 4-way P

PP

PP

PP

PP

Page 8: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 8

Outline

• Hybrid Full-System Simulation

• Multiprocessor Host Interleaving

• BlueSPARC Implementation

• Conclusion

Page 9: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 9

3P

Hybrid Full-System Simulation

• 3 ways to map target components

FPGA-only Simulation-only Transplantable

• CPUs can fallback to SW by “transplanting” between hosts

– Only common-case instructions/behaviors implemented in FPGA

– Complex, rare behaviors relegated to software

1 2 3

P P

Memory

CD FIBRE

GFX NIC PCI

TERM

SCSI

PC Host Simulator

FPGA host

1

2

I/O instr

PP

transplant

Transplants reduce full-system complexity

Page 10: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 10

Multiprocessor Host Interleaving

# processors to model in functional simulator

1-to-1 mapping between target and host CPUs

# soft cores implemented in FPGA

1-to-1

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

4-to-1 host interleaving

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

P

Advantages:

• Trade away FPGA throughput for smaller implementation

• Decouple logical simulated size from FPGA host size

• Host processor in FPGA can be made very simple

Page 11: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 11

FPGA Host Processor

• Architecturally executes multiple contexts

– Existing multithreaded micro-architectures are good candidates

• Our design: Instruction-Interleaved Pipeline

– Switch in new processor context on each cycle

– Simple, efficient design w/ no stalling or forwarding

– Long-latency tolerance (e.g., cache miss, transplants)

– Coherence is “free” between contexts mapped to same pipeline

Page 12: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 12

Outline

• Hybrid Full-System Simulation

• Host MP Interleaving

• BlueSPARC Implementation

• Conclusion

Page 13: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 13

BlueSPARC Simulator

• Functionally models 16-CPU UltraSPARC III Server

– HW components are hosted on BEE2 FPGA platform

• Virtutech Simics used as full-system I/O simulator

• 5 Virtex-II Pro 70s (~66 KLUTs)• 2 embedded PowerPCs per FPGA

• Only use 2 out of 5 FPGAs (pipeline + DDR controllers)

Berkeley Emulation Engine 2

Page 14: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 14

BlueSPARC Simulator

Virtex II Pro 70 Core 2 Duo

16-wayPipeline

PowerPC(Hard core)

EthernetInterface to 2ndFPGA (memory)

Processor Bus

Simulated I/O Devices (using

full-system simulator)

Simics simulates devices, memory-

mapped I/O and DMA (~500-1000 µsec)

PowerPC simulates unimplemented

SPARC instructions (~10-50 µsec)

Page 15: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 15

Hybrid host partitioning choices

BlueSPARC PowerPC405 Simics on PC

• Integer ALU• Register Windows• Traps/Interrupts• I-/D-MMU• Memory + Atomics• 36% special SPARC insns

• Multimedia• Floating Point• Rare TLB operations• 64% special SPARC insns

•PCI bus•ISP2200 Fibre Channel•I21152 PCI bridge•IRQ bus•Text Console•SBBC PCI device•Serengeti I/O PROM•Cheerio-hme NIC•SCSI bus

Implementation = 60% of basic ISA,36% of special SPARC insns, 0% devices

Page 16: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 16

BlueSPARC host microarchitecture

TransplantUnit

TransplantUnit

1 2, 3 4,5 6 87 9,10,11 12,13 14

PowerPC405 (transplant service processor)

64KBI-cache64KB

I-cache

I-TLB16-entry(direct-

mapped)x16

I-TLB16-entry(direct-

mapped)x16

I-TLB128-entry

(2-way)x16

I-TLB128-entry

(2-way)x16

ALU1ALU1ALU2ALU2

64KBD-cache64KB

D-cache

TrapUnitTrapUnit

WritebackUnit

WritebackUnit

D-TLB512-entry

(2-way)x16

D-TLB512-entry

(2-way)x16

D-TLB16-entry

(fully-assoc)

x16

D-TLB16-entry

(fully-assoc)

x16

RegFileRegFile

DecodeDecodePC, statex16

PC, statex16

ContextSelectorContextSelector

AssistUnit

AssistUnit

Normal pipeline stageNormal pipeline stage Multi-context stateMulti-context state Transplant support unitTransplant support unit64-bit ISA, SW-visible MMU, complex memory

high # of pipeline stages

Page 17: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 17

BlueSPARC Simulator (continued)

Processing Nodes 16 64-bit UltraSPARC III contexts14-stage instruction-interleaved pipeline

L1 caches Split I/D, 64KB, 64B, direct-mapped, writebackNon-blocking loads/stores16-entry MSHR, 4-entry store buffer

Clock frequency 90MHz on Xilinx V2P70

Main memory 4GB total

Resources (Xilinx V2P70)

33.5 KLUTs (50%), 222 BRAMs (67%) w/o stats+debug43.2 KLUTs (65%), 238 BRAMs (72%)

Instrumentation All internal state fully traceableAttachable to FPGA-based CMP cache simulator

Statistics 25K lines Bluespec HDL, 511 rules, 89 module types

Software Runs unmodified Solaris + closed-source binaries

Page 18: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 18

Performance

39x speedup on average over Simics-trace

010203040506070

orac

le

bzip

2

craf

ty gcc

gzip

pars

ervo

rtex

aver

age

MIP

S

BlueSPARC (90mhz)Simics-fast (2.0GHz C2Duo)Simics-trace

1.18

Page 19: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 19

One more thing…

• Our latest application w/ BlueSPARC: Trace CMP

L1 Caches

Cache

Contents

InstructionCaches

DataCaches

Statistics

L2 Cache

Statistics

8-way Pseudo-LRU

16 64KB 2-way split I&D L1 caches

8 ways

Memory

References

Statistics

4MB 8-way L2 cache

Page 20: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 20

BlueSPARC Demo on BEE2

20

• Demo application

– On-Line Transaction Processing benchmark (TPC-C) in Oracle

– Runs in Solaris 8 (unmodified binary)

– FPGA + Memory directly loaded from Simics checkpoint

4 DDR2 Controllers + 4 GB memory

Ethernet (to Simics

on PC)

Virtex-II Pro 70 (PowerPC & BlueSPARC) RS232 (Debugging)

BEE2 Platform

Page 21: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 21

Summary

• Two techniques for reducing complexity

– Hybrid full-system simulation

– MP host interleaving

• Future work

– Timing extensions

– Larger-scale implementation (hundreds of CPUs)

– Run-time instrumentation tools

Page 22: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 22

Part II: Hands-on Tutorial

Page 23: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 23

Page 24: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 24

Page 25: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 25

Setup Procedures

• Divide up into FOUR groups

• Each group needs 1 laptop with:

– Remote Desktop Connection for Windows XP

• Instructions:

– http://www.ece.cmu.edu/~protoflex/wiki

– user/login: ramp/ramp

Page 26: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 26

Live CMP Cache Statistics

• Statistics updated in real-time on webpage:

– http://www.ece.cmu.edu/~protoflex/asplos

• Click on tabs to watch statistics for any of the four BEE2 boards as they run simulations

Page 27: ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs · 2008. 8. 28. · ProtoFlex Tutorial: Full-System MP Simulations Using FPGAs Eric S. Chung, Michael Papamichael, ... simulation

2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 27

Thanks! Any [email protected]://www.ece.cmu.edu/~protoflex

AcknowledgementsWe would like to thank our colleagues inthe RAMP and TRUSS projects.