19
RAMP Gold RAMPants {rimas,waterman,yunsup}@cs Parallel Computing Laboratory University of California, Berkeley

RAMP Gold

  • Upload
    alcina

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

RAMP Gold. RAMPants {rimas,waterman,yunsup}@cs. Parallel Computing Laboratory University of California, Berkeley. A Survey of μArch Simulation Trends. Typical ISCA 2008 papers simulated about about twice as many instructions as those in 1998. So what?. A Survey of μArch Simulation Trends. - PowerPoint PPT Presentation

Citation preview

Page 1: RAMP  Gold

RAMP Gold

RAMPants{rimas,waterman,yunsup}@cs

Parallel Computing LaboratoryUniversity of California, Berkeley

Page 2: RAMP  Gold

A Survey of μArch Simulation Trends

• Typical ISCA 2008 papers simulated about about twice as many instructions as those in 1998. So what?

Page 3: RAMP  Gold

A Survey of μArch Simulation Trends

• Something seems broken here…

Page 4: RAMP  Gold

A Survey of μArch Simulation Trends

• Something is clearly broken here.

Page 5: RAMP  Gold

Something is Rotten in theState of California

• A median ISCA ‘08 paper’s simulations run for fewer than four OS scheduling quanta!– We run yesterday’s apps at yesteryear’s

timescales– And attempt to model N communicating cores

with O(1/N) instructions per core?!

• The problem is that simulators are too slow– Irony: since performance scales as

sqrt(complexity), simulated instructions per wall-clock second falls as processors get faster

Page 6: RAMP  Gold

RAMP Gold: Our Solution

• RAMP Gold is an FPGA-based, 100 MIPS manycore simulator– Only 100x slower than real-time– Economical: RTL is BSD-licensed;

commodity HW

Cost Performance(MIPS) Simulations per day

SoftwareSimulator $2,000 0.1 - 1 1

RAMP Gold $2,000 + $750 50 - 100 100

Page 7: RAMP  Gold

Our Target Machine

SPARC V8CORE

SPARC V8CORE

I$I$ D$D$

DRAMDRAM

Shared L2$ / InterconnectShared L2$ / Interconnect

SPARC V8CORE

SPARC V8CORE

I$I$ D$D$

SPARC V8CORE

SPARC V8CORE

I$I$ D$D$

SPARC V8CORE

SPARC V8CORE

I$I$ D$D$

64 cores

Page 8: RAMP  Gold

RAMP Gold Architecture

• Mapping the target machine directly to an FPGA is inefficient

• Solution: split timing and functionality– The timing logic decides how many target

cycles an instruction sequence should take– Simulating the functionality of an

instruction might take multiple host cycles– Target time and host time are orthogonal

Page 9: RAMP  Gold

Function/Timing Split Advantages

• Flexibility– Can configure target at runtime– Synthesize design once, change target

model parameters at will

• Efficient FPGA resource usage– Example 1: model a 2-cycle FPU in 10

host cycles– Example 2: model a 16MB L2$ using

only 256KB host BRAM to store tags/metadata

Page 10: RAMP  Gold

Host Multithreading

• How are we going to model 64 cores?

FF DD XX MM WW

FF DD XX MM WW

FF DD XX MM WW

FF DD XX MM WW

…64 pipelines

time

FF DD XX MM WW

FF DD XX MM WW

FF DD XX MM WW

time

Build 64 pipelines Time-multiplex one pipeline

• Single hardware pipeline with multiple copies of CPU state• No bypass path required• Not multithreaded target!

FF DD XX MM WW

FF DD XX MM WW

Page 11: RAMP  Gold

Cache Modeling

• The cache model maintains tag, state, protocol bits internally

• Whenever the functional model issues a memory operation, the cache model determines how many target cycles to stall

tagtag indexindex offsetoffset

tag, statetag, state tag, statetag, state tag, statetag, state

==== ==

hit : don’t stallmiss : stall arbitrary cycles

maxassociativity

Page 12: RAMP  Gold

Putting it all together

instruction cache

ifetch stage

decode stage

register access stage

memory stage

data cache

exception stage

memory controller

cache model & performance counters

Resource Utilization (XC5VLX110T)• LUTs – 14%, BRAM – 23%We can fit 3 pipelines on one FPGA!

Page 13: RAMP  Gold

Infrastructure

ELF to BRAM Loader

Standard ELF Loader

Frontend Test Server

ELF to DRAM Loader

SPARC V8 Verification Suite

(.S or .C)

GNU SPARC v8 Compiler/ Linker(sparc-linux-gcc, sparc-linux-as, sparc-linux-ld)

Customized Linker Script

(.lds)

RAMP Gold Systemverilog Source

Files / netlist(.sv, .v)

ELF big-endian Binaries

Modelsim SE/ Questasim 6.4a

Host dynamic simulation libraries

(.so)

GNU libbfd library(from GNU binutil)

SPARC v8 disassembler C implementationFrontend Links

Xilinx Unisim Library Systemverilog DPI interface

Simulation log/Console output

Checker

C Functional Simulator

Frontend Links

Reference Result

FPGA TargetXilinx ML505, BEE3

Frontend Links

Memory/Register dumps

Solaris/Linux Machine(Handle Syscall)

Page 14: RAMP  Gold

Our accomplishments this semester

Jan 2009 Last Night

Single Threaded 64 Way Host Multithreaded

0.000032GB BRAM 2GB DDR2 SDRAM

Hello World works (sometimes) ParLab Damascene CBIR App,SPLASH2 + SPEC CPU2000

No Timing Model or Introspection

Runtime Configurable Cache Model, Performance Counters

No Floating Point Hardware FPU Multiply/Add + Software Emulation

Page 15: RAMP  Gold

HARDware ain’t no joke

500+ Man Hours DDR2 Memory Controller Debugging

100+ Man Hours DMA Engine/Pipeline Issues

150+ Man Hours CAD Tool Issues

0 Pipeline Corner Cases

Page 16: RAMP  Gold

Sample Use Case: L1 D$ Tradeoffs

• Assume we have a 64-core CMP with private 16KB direct-mapped L1 D$

• In the next tech gen, we can fit either of these improved configurations in a clock cycle:– 32KB direct-mapped L1– 16KB 4-way set-associative L1

• Which should we choose?

Page 17: RAMP  Gold

Sample Use Case: L1 D$ Tradeoffs

• Evidently, the associative cache is superior• It took longer to make these slides than to

run these 10+ billion instruction simulations

Page 18: RAMP  Gold

Future Directions

• RAMP Gold closes two critical feedback loops– Expedient HW/SW co-tuning is within our

grasp– Simulations can now be run on a

thermal timescale, enabling the exploration of temperature-aware scheduling policies

• We intend to explore both avenues!

Page 19: RAMP  Gold

DEMO: Damascene

ImageImage

Convert Colorspace

Convert Colorspace

Textons:K-meansTextons:K-means

Texture GradientTexture Gradient

CombineCombine

Non-maxsuppression

Non-maxsuppression

Intervening ContourIntervening Contour

Generalized EigensolverGeneralized Eigensolver

Oriented Energy Combination

Oriented Energy Combination

Combine, NormalizeCombine, Normalize

ContoursContours

BgBg CgaCga CgbCgb