Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Computer Architecture Lab at
ProtoFlex Tutorial:Full-System MP Simulations Using FPGAs
Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi,James C. Hoe, Babak Falsafi, Ken Mai
PROTOFLEX
Our work in this area has been supported in part by NSF, IBM, Intel, Sun, and Xilinx.
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 2
Overview
• Part I: Basic ProtoFlex Concepts
– FPGA Virtualization techniques
– BlueSPARC FPGA-accelerated simulator
• Part II: Hands-on Technology Preview
– Access CMU’s remote FPGA infrastructure
– Compile and run code within an FPGA-accelerated simulation of 16-CPU UltraSPARC III server
– Run real-time CMP cache simulations
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 3
Part I: Basic Concepts
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 4
Full-system Functional Simulators
• Convenient exploration of real (or future) HW
– Can boot OS, run commercial apps
• But too slow for large-scale (>100-way) MP studies
– Existing functional simulators single-threaded
– High instrumentation overhead
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 5
FPGA-based simulation
• Advantages
– Fast and flexible
– Low HW instrumentation performance overhead
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 6
FPGA-based simulation
• Caveat: simulation w/ FPGAs more complex than SW
– Large-MP configurations (>100) expensive to scale
– Full-system support need device models + full ISA
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 7
Reducing Complexity
Hybrid Full-System Simulation
Multiprocessor Host Interleaving
21
CPUP
Memory Devices
P
Common-case behaviors
Uncommon behaviors Memory
4-way P 4-way P
PP
PP
PP
PP
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 8
Outline
• Hybrid Full-System Simulation
• Multiprocessor Host Interleaving
• BlueSPARC Implementation
• Conclusion
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 9
3P
Hybrid Full-System Simulation
• 3 ways to map target components
FPGA-only Simulation-only Transplantable
• CPUs can fallback to SW by “transplanting” between hosts
– Only common-case instructions/behaviors implemented in FPGA
– Complex, rare behaviors relegated to software
1 2 3
P P
Memory
CD FIBRE
GFX NIC PCI
TERM
SCSI
PC Host Simulator
FPGA host
1
2
I/O instr
PP
transplant
Transplants reduce full-system complexity
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 10
Multiprocessor Host Interleaving
# processors to model in functional simulator
1-to-1 mapping between target and host CPUs
# soft cores implemented in FPGA
1-to-1
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
4-to-1 host interleaving
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
P
Advantages:
• Trade away FPGA throughput for smaller implementation
• Decouple logical simulated size from FPGA host size
• Host processor in FPGA can be made very simple
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 11
FPGA Host Processor
• Architecturally executes multiple contexts
– Existing multithreaded micro-architectures are good candidates
• Our design: Instruction-Interleaved Pipeline
– Switch in new processor context on each cycle
– Simple, efficient design w/ no stalling or forwarding
– Long-latency tolerance (e.g., cache miss, transplants)
– Coherence is “free” between contexts mapped to same pipeline
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 12
Outline
• Hybrid Full-System Simulation
• Host MP Interleaving
• BlueSPARC Implementation
• Conclusion
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 13
BlueSPARC Simulator
• Functionally models 16-CPU UltraSPARC III Server
– HW components are hosted on BEE2 FPGA platform
• Virtutech Simics used as full-system I/O simulator
• 5 Virtex-II Pro 70s (~66 KLUTs)• 2 embedded PowerPCs per FPGA
• Only use 2 out of 5 FPGAs (pipeline + DDR controllers)
Berkeley Emulation Engine 2
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 14
BlueSPARC Simulator
Virtex II Pro 70 Core 2 Duo
16-wayPipeline
PowerPC(Hard core)
EthernetInterface to 2ndFPGA (memory)
Processor Bus
Simulated I/O Devices (using
full-system simulator)
Simics simulates devices, memory-
mapped I/O and DMA (~500-1000 µsec)
PowerPC simulates unimplemented
SPARC instructions (~10-50 µsec)
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 15
Hybrid host partitioning choices
BlueSPARC PowerPC405 Simics on PC
• Integer ALU• Register Windows• Traps/Interrupts• I-/D-MMU• Memory + Atomics• 36% special SPARC insns
• Multimedia• Floating Point• Rare TLB operations• 64% special SPARC insns
•PCI bus•ISP2200 Fibre Channel•I21152 PCI bridge•IRQ bus•Text Console•SBBC PCI device•Serengeti I/O PROM•Cheerio-hme NIC•SCSI bus
Implementation = 60% of basic ISA,36% of special SPARC insns, 0% devices
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 16
BlueSPARC host microarchitecture
TransplantUnit
TransplantUnit
1 2, 3 4,5 6 87 9,10,11 12,13 14
PowerPC405 (transplant service processor)
64KBI-cache64KB
I-cache
I-TLB16-entry(direct-
mapped)x16
I-TLB16-entry(direct-
mapped)x16
I-TLB128-entry
(2-way)x16
I-TLB128-entry
(2-way)x16
ALU1ALU1ALU2ALU2
64KBD-cache64KB
D-cache
TrapUnitTrapUnit
WritebackUnit
WritebackUnit
D-TLB512-entry
(2-way)x16
D-TLB512-entry
(2-way)x16
D-TLB16-entry
(fully-assoc)
x16
D-TLB16-entry
(fully-assoc)
x16
RegFileRegFile
DecodeDecodePC, statex16
PC, statex16
ContextSelectorContextSelector
AssistUnit
AssistUnit
Normal pipeline stageNormal pipeline stage Multi-context stateMulti-context state Transplant support unitTransplant support unit64-bit ISA, SW-visible MMU, complex memory
high # of pipeline stages
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 17
BlueSPARC Simulator (continued)
Processing Nodes 16 64-bit UltraSPARC III contexts14-stage instruction-interleaved pipeline
L1 caches Split I/D, 64KB, 64B, direct-mapped, writebackNon-blocking loads/stores16-entry MSHR, 4-entry store buffer
Clock frequency 90MHz on Xilinx V2P70
Main memory 4GB total
Resources (Xilinx V2P70)
33.5 KLUTs (50%), 222 BRAMs (67%) w/o stats+debug43.2 KLUTs (65%), 238 BRAMs (72%)
Instrumentation All internal state fully traceableAttachable to FPGA-based CMP cache simulator
Statistics 25K lines Bluespec HDL, 511 rules, 89 module types
Software Runs unmodified Solaris + closed-source binaries
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 18
Performance
39x speedup on average over Simics-trace
010203040506070
orac
le
bzip
2
craf
ty gcc
gzip
pars
ervo
rtex
aver
age
MIP
S
BlueSPARC (90mhz)Simics-fast (2.0GHz C2Duo)Simics-trace
1.18
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 19
One more thing…
• Our latest application w/ BlueSPARC: Trace CMP
L1 Caches
Cache
Contents
InstructionCaches
DataCaches
Statistics
L2 Cache
Statistics
8-way Pseudo-LRU
…
16 64KB 2-way split I&D L1 caches
8 ways
Memory
References
Statistics
…
4MB 8-way L2 cache
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 20
BlueSPARC Demo on BEE2
20
• Demo application
– On-Line Transaction Processing benchmark (TPC-C) in Oracle
– Runs in Solaris 8 (unmodified binary)
– FPGA + Memory directly loaded from Simics checkpoint
4 DDR2 Controllers + 4 GB memory
Ethernet (to Simics
on PC)
Virtex-II Pro 70 (PowerPC & BlueSPARC) RS232 (Debugging)
BEE2 Platform
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 21
Summary
• Two techniques for reducing complexity
– Hybrid full-system simulation
– MP host interleaving
• Future work
– Timing extensions
– Larger-scale implementation (hundreds of CPUs)
– Run-time instrumentation tools
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 22
Part II: Hands-on Tutorial
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 23
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 24
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 25
Setup Procedures
• Divide up into FOUR groups
• Each group needs 1 laptop with:
– Remote Desktop Connection for Windows XP
• Instructions:
– http://www.ece.cmu.edu/~protoflex/wiki
– user/login: ramp/ramp
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 26
Live CMP Cache Statistics
• Statistics updated in real-time on webpage:
– http://www.ece.cmu.edu/~protoflex/asplos
• Click on tabs to watch statistics for any of the four BEE2 boards as they run simulations
2-Mar-2008 RAMP 2008 Tutorial / Eric S. Chung and Michael Papamichael 27
Thanks! Any [email protected]://www.ece.cmu.edu/~protoflex
AcknowledgementsWe would like to thank our colleagues inthe RAMP and TRUSS projects.