44
The MachSuite Benchmark Brandon Reagen Robert Adolf, Yakun Sophia Shao Sam Xi, Gu-Yeon Wei David Brooks

The MachSuite Benchmark Brandon Reagen Robert Adolf, Yakun Sophia Shao Sam Xi, Gu-Yeon Wei David Brooks

Embed Size (px)

Citation preview

The MachSuite Benchmark

Brandon ReagenRobert Adolf, Yakun Sophia Shao

Sam Xi, Gu-Yeon Wei David Brooks

Who Cares about Accelerators

Architecture

Cause: Transistors scalingEffect: Specialization & SoCs

Who Cares about Accelerators

Architecture CAD

Cause: Transistors scalingEffect: Specialization & SoCs

Cause: RTL design costsEffect: C-to-RTL tools

Who Cares about Accelerators

Architecture CAD ASICs

Cause: Transistors scalingEffect: Specialization & SoCs

Cause: RTL design costsEffect: C-to-RTL tools

Cause: Performance needsEffect: Build tuned IC

What’s Next

Architecture CAD ASICs

Cause: RTL design costsEffect: C-to-RTL tools

Cause: Performance needsEffect: Build tuned IC

- System Integration- Composability- Flexibility

What’s Next

Architecture CAD ASICs

Cause: Performance needsEffect: Build tuned IC

- System Integration- Composability- Flexibility

- Faster Turn Around- Larger App Space- Complex Designs

What’s Next

Architecture CAD ASICs

- System Integration- Composability- Flexibility

- Faster Turn Around- Larger App Space- Complex Designs

- Not much change- Need high perf ICs- H.266

What’s Missing

Architecture CAD ASICs

- System Integration- Composability- Flexibility

- Faster Turn Around- Larger App Space- Complex Designs

- Not much change- Need high perf ICs- H.266

Well defined specs

What’s Missing

Architecture CAD ASICs

- System Integration- Composability- Flexibility

- Faster Turn Around- Larger App Space- Complex Designs

- Not much change- Need high perf ICs- H.266

Well defined specsWorkload definition, common baseline

Tower of Babel Effect

10

Big Problem.

MachSuite is/has

• 19 application specific accelerator workloads

• HLS and Aladdin compatible

• Workloads researchers are using today

• Diverse workloads for app space coverage

• Establishes standards without stifling creativity

Why MachSuite

• Existing Benchmarks are not applicable/sufficient

• Works with Accelerator Simulators and CAD tools

• Representative applications covering wide space

• Kernel Selection

• Algorithm Choice

• Implementation Details

WHY MACHSUITECOMPARING BENCHMARKS

Existing Benchmarks are Insufficient

High-Level Synthesis

Is good at

Scientific Codes{ GEMM, FFT }

Crypto { AES, DES, SHA }

Image/Multimedia{ Stencils, JPEG, SAD}

3 of 13 Berkeley Dwarves[CHStone, ISCAS]

Existing Benchmarks are Insufficient

High-Level Synthesis

Is good at Needs ImprovementIrregular Behavior{ BFS, SPMV CRS}

Scientific Codes{ GEMM, FFT }

Crypto { AES, DES, SHA }

Complex App Codes{ BackProp, MD }

Application Space Coverage

Image/Multimedia{ Stencils, JPEG, SAD}

3 of 13 Berkeley Dwarves[CHStone, ISCAS]

12 of 13 Berkeley Dwarves[MachSuite, IISWC/BARC]

Existing Benchmarks not Applicable

• Many Existing GPU Benchmarks– Rodinia, Parboil, SHOC..

• GPU and Accelerator design spaces differ– Tuned for GPU architecture– Implemented in CUDA/OpenCL– GPU workloads subset of accelerators

WHY MACHSUITESIMULATOR/HLS FRIENDLY

Works with Accelerator CAD Tools

Vivado HLS

DirectivesC Code

RTL(Hardware Description Language)

Functions Units

Resource Sharing

Loop Pipelining

Memory Bandwidth

High-Level Synthesis

Works with Simulators

MachSuite

Works with Simulators

MachSuite

DirectivesFunctions Unit Selection

Loop Pipelining

Memory Bandwidth

Trade-off Power/Performance

WHY MACHSUITEWORKLOAD DIVERSITY AND COVERAGE

Incorporates Applications of Interest

Covers Application Space

FFT

GEMM

STENCIL

12 of 13 Dwarves

MachSuite Design

• Existing Benchmarks are not applicable/sufficient

• Works with Accelerator Simulators and CAD tools

• Representative applications covering wide space

• Kernel Selection

• Algorithm Choice

• Implementation Details

MACHSUITE DESIGNKERNEL SELECTION

Kernel Selection

• Kernel = A specific problem– E.g: SORT

Kernel Selection

• Kernel = A specific problem– E.g: SORT

• The Problem– Not all using the same kernels– Comparing similar sounding kernels doesn’t work

Let’s just pick one

MACHSUITE DESIGNALGORITHM CHOICE

Algorithm Choice

• Algorithm = A specific solution– A type of kernel– E.g: Merge or Radix SORT

Algorithm Choice

• Algorithm = A specific solution– A type of kernel– E.g: Merge or Radix SORT

• The problem– Reporting kernel too high level– Ideal algorithms different across SoCs

Standardization without limitation

MACHSUITE DESIGNIMPLEMENTATION DETAILS

Implementation Details

• Implementation = Specific code for algorithm– E.g: Stencil in Rodinia vs Parboil

Implementation Details

• Implementation = Specific code for algorithm– E.g: Stencil in Rodinia vs Parboil

• The problem– Can cause misleading results– Performance depends on tuning

Separate signal from noise

Performance Variance due toImplementation Details

1 Kernel 1 Algorithm1 Implementation

Performance Variance due toImplementation Details

1 Kernel 1 Algorithm2 Implementations

~ 10x Performance, same power

Root Causing Inefficiency

Same directives:- Single port SRAMs- 8 way partition- Same loops pipelined

Different Implementations for parallel SCAN

What Happened

• “Unoptimized C Code”– Pipelining result: Target II: 1, Final II: 30

• “Optimized C Code”– Pipelining result: Target II: 1, Final II: 8

37

What HappenedUnoptimized C Code

for i = 1 : Block

for radixID : Radix bucket[i*Block+radixID ] +=

bucket[i*Block+ radixID-1];

38

for radixID : Radix for i = 1 : Block

bucket[i*Block +radixID ] += bucket[i*Block +

radixID-1];

39

What HappenedOptimized C Code

Solution

40

SCANAccelerator

SCANAccelerator

MEMORY MEMORY

Solution

41

SCANAccelerator

SCANAccelerator

MEMORY MEMORY

Solution

42

SCANAccelerator

SCANAccelerator

MEMORY MEMORY

MachSuite

• 19 application specific accelerator workloads

• Benchmarks work with HLS and Aladdin

• Represents workloads researchers are using

• Diverse workloads, broad application space

• Standards with limited restrictions

MachSuite Available on GitHub

http://breagen.github.io/MachSuite/

Publications

Aladdin: [ ISCA’14 ]MachSuite: [ IISWC’14 ]

Quantifying Acceleration: [ ISLPED’13 ]