28
University of Michigan Electrical Engineering and Computer Science Amir Hormati , Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable Stream Programming on Graphics Engines

Sponge: Portable Stream Programming on Graphics Engines

  • Upload
    morey

  • View
    71

  • Download
    2

Embed Size (px)

DESCRIPTION

Amir Hormati , Mehrzad Samadi , Mark Woh , Trevor Mudge , and Scott Mahlke. Sponge: Portable Stream Programming on Graphics Engines. Why GPUs?. Every mobile and desktop system will have one Affordable and high performance Over-provisioned Programmable. Sony PlayStation Phone. - PowerPoint PPT Presentation

Citation preview

Page 1: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science

Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke

Sponge: Portable Stream Programming on Graphics Engines

Page 2: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science2

Why GPUs?• Every mobile and desktop

system will have one

• Affordable and high performance

• Over-provisioned

• Programmable

Sony PlayStation Phone

2002 2003 2004 2005 2006 2007 2008 2009 2010 20110

250

500

750

1000

1250

1500

NVIDIA GPU

INTEL CPUTh

eore

tical

GFL

OPS/

s

GeForce GTX 480

GeForce GTX 280

GeForce 8800 GTX

GeForce 7800 GTX GeForce 6800 Ultra

Page 3: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science3

GPU Architecture

Shared

Regs

0 1

2 3

4 56 7

Interconnection Network

CPU

SM 0 SM 1 SM 29

Kernel 1

Kernel 2

Tim

e 0 1

2 3

4 5

6 7

Shared

Regs

0 1

2 3

4 56 7

Shared

Regs

0 1

2 3

4 56 7

Registers

Global Memory (Device Memory)

Shared Memory

Page 4: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science4

GPU Programming ModelPer-block Register

Grid 1

Grid 0

Per-appDevice Global

Memory

Grid Sequence

__shared__ int GlobalVar

Per-block Shared Memory

Block

__shared__ int SharedVar

int LocalVarArray[10]

int RegisterVar

Thread

Per-threadRegister

Thread

Per-threadLocal

Memory

Per-block Shared Memory

Thread

int LocalVarArray[10]

• Threads Blocks Grid

• All the threads run one kernel

• Registers private to each thread

• Registers spill to local memory

• Shared memory shared between threads of a block

• Global memory shared between all blocks

Page 5: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science5

Grid 1

GPU Execution Model

SM 1

Shared

Regs

0 1

2 3

4 5

6 7

SM 0

Shared

Regs

0 1

2 3

4 5

6 7

SM 2

Shared

Regs

0 1

2 3

4 5

6 7

SM 3

Shared

Regs

0 1

2 3

4 5

6 7

SM 30

Shared

Regs

0 1

2 3

4 5

6 7

Page 6: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science6

GPU Execution Model

Block 0

Block 1

Block 3

Shared

Registers

0 1

2

4 5

3

6 7

SM0

Block 2

Warp 0 Warp 1

ThreadId0 31 32 63

Page 7: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science7

GPU Programming Challenges

0

50

100

150

200

250

300

350

400

64

4832

16

8

Number of Registers Per Thread

Tim

e (m

s)

High Performance Desktop Mobile

Optimized forGeForce GTX 285

Optimized forGeForce 8400 GS

• Data restructuring for complex memory hierarchy efficiently– Global memory, Shared memory, Registers

• Partitioning work between CPU and GPU

• Lack of portability between different generations of GPU– Registers, active warps, size of global

memory, size of shared memory

• Will vary even more– Newer high performance cards e.g. NVIDA’s

Fermi– Mobile GPUs with less resources

Page 8: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science8

Nonlinear Optimization Space

[Ryoo , CGO ’08]

SAD Optimization Space

908 Configurations

We need higher level of abstraction!

Page 9: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science9

Goals

• Write-once parallel software

• Free the programmer from low-level details

(C + Pthreads) Shared Memory Processors

(C +Intrinsics) SIMD Engines

(Verilog/VHDL) FPGAs

(CUDA/OpenCL) GPUs

Parallel Specification

Page 10: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science10

Streaming

• Higher-level of abstraction

• Decoupling computation and memory accesses

• Coarse grain exposed parallelism, exposed communication

• Programmers can focus on the algorithms instead of low-level details

• Streaming actors use buffers to communicate

• A lot of recent works on extending portability of streaming applications

Actor 1

Actor 2 Actor 5

Splitter

Actor 4Actor 3

Joiner

Actor 6

Page 11: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science11

Sponge– Generating optimized CUDA for a wide

variety of GPU targets

– Perform an array of optimizations on stream graphs

– Optimizing and porting to different generations

– Utilize memory hierarchy (registers, shared memory, coallescing)

– Efficiently utilize streaming cores

Reorganization and Classification

Memory Layout

Graph

Restructuring

Register Optimization

Shared/Global Memory

Helper Threads

Bank Conflict Resolution

Loop Unrolling

Software Prefetching

Page 12: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science12

GPU Performance Model- Memory bound Kernels

M 0 M 1 M 2 M 3 M 4 M 5 M 6 M 7C 0 C 1 C 2 C 3 C 4 C 5 C 6 C 7

≈ Memory Time

- Computation bound Kernels

M 0 M 1 M 4 M 5M 2 M 3 M 6 M 7

C 0 C1 C 2 C 3 C 4 C 5 C 6 C 7

≈ Computation Time

M CMemory Instructions Computation Instructions

Page 13: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science13

Actor Classification• High Traffic Actors(HiT)

– Large number of memory accesses per actor– Less number of threads with shared memory– Using shared memory underutilizes the processors

• Low Traffic Actors(LoT)– Less number of memory accesses per actor– More number of threads– Using shared memory increases the performance

Page 14: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science14

Thread 1 Thread 2 Thread 3Thread 0

1514131211109876543210

1514131211109876543210

Global Memory Accesses

A[4,4]

Global Memory

Global Memory2 6 10 14

2 6 10 141 5 9 13

1 5 9 13

0 4 8 12

0 4 8 12

3 7 11 15

3 7 11 15

• Large access latency

• Not access the words in sequence

• No coalescing

A[4,4] A[4,4] A[4,4]

A[i, j] Actor A has i pops and j pushes

Page 15: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science15

Thread 3Thread 2Thread 1Thread 0

Shared Memory

1514131211109876543210

1514131211109876543210

A[4,4] A[4,4] A[4,4] A[4,4]

Shared Memory

Shared Memory

1514131211109876543210

1514131211109876543210

Global To

Shared

Global To

Shared

Global To

Shared

Global To

Shared

Global Memory

Global Memory 3210

3210

7654

7654

111098

111098

15141312

15141312

3210

3210

7654

7654

111098

111098

15141312

15141312

Shared to

Global

Shared to

Global

Shared to

Global

Shared to

Global

Page 16: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science16

Using Shared Memory

• Shared memory is 100x faster than global memory

• Coalesce all global memory accesses

• Number of threads is limited by size of the shared memory.

For number of iterationsFor number of pops

For number of pushs

Shared Global

Shared Global

syncthreads

syncthreads

End Kernel

Begin Kernel <<<Blocks, Threads>>>:

Work

Begin Kernel <<<Blocks, Threads>>>:

For number of iterations

End Kernel

Work

Page 17: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science17

For number of iterations

syncthreads

syncthreads

If helper threads

If helper threads

If worker threads

End Kernel

Begin Kernel <<< Blocks, Threads + Helper >>>:

Work

Shared Global

Shared Global

Helper Threads

• Shared memory limits the number of threads.

• Underutilized processors can fetch data.

• All the helper threads are in one warp. (no control flow divergence)

Begin Kernel <<<Blocks, Threads>>>:

For number of iterations

End Kernel

Work

Page 18: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science18

For number of iterations

syncthreads

syncthreads

For number of pops

For number of pops

For number of pops

For number of pushs

Begin Kernel <<<Blocks, Threads>>>:

End Kernel

If not the last iteration

Work

Regs Global

Regs Global

Shared Regs

Shared Global

Data Prefetch• Better register utilization

• Data for iteration i+1 is moved to registers

• Data for iteration i is moved from register to shared memory

• Allows the GPU to overlap instructions

For number of iterationsFor number of pops

For number of pushs

Shared Global

Shared Global

syncthreads

syncthreads

End Kernel

Begin Kernel <<<Blocks, Threads>>>:

Work

Page 19: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science19

Loop unrolling• Similar to traditional unrolling

• Allows the GPU to overlap instructions

• Better register utilization

• Less loop control overhead

• Can also be applied to memory transfer loops

For number of iterations/2

Begin Kernel <<<Blocks, Threads>>>:

End Kernel

syncthreads

syncthreads

Work

Work

For number of popsShared Global

For number of popsShared Global

syncthreads

syncthreads

For number of pushsShared Global

For number of pushsShared Global

Page 20: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science20

Methodology

• Set of benchmarks from the StreamIt Suite• 3GHz Intel Core 2 Duo CPU with 6GB RAM• Nvidia Geforce GTX 285

StreamProcessors

Processor clock

Memory Configuration Memory Bandwidth

240 1476 MHz 2GB DDR3 159.0 GB/s

Page 21: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science21

Result (Baseline CPU)DCT

FFT

Matrix

Multipl

yMatr

ix Mult

iplyB

lock

Biton

ic

Batch

er

Radix

Merge

Sort

Compa

rision

Cou

nting

Vecto

r Add

Histog

ram

Aver

age

05

101520253035404550

With Transfer Without Transfer

Spee

dup(

X)

10

24

Page 22: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science22

Result (Baseline GPU)DCT

FFT

Matrix M

ultiply

Matrix M

ultiply

Block

Biton

ic

Batch

er

Radix

Merge S

ortCom

paris

ion C

ounti

ng

Vecto

r Add

Histogra

m

Avera

ge

0

1

2

3

4

5

6

7

Shared/Global Prefetch/Unrolling Helper Threads Graph Restructuring

Spee

dup(

X)

64%

3%16%16%

Page 23: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science23

Conclusion• Future systems will be heterogeneous

• GPUs are important part of such systems

• Programming complexity is a significant challenge

• Sponge automatically creates optimized CUDA code for a wide variety of GPU targets

• Provide portability by performing an array of optimizations on stream graphs

Page 24: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science24

Questions

Page 25: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science25

Spatial Intermediate Representation

• StreamIt• Main Constructs:

– Filter Encapsulate computation.

– Pipeline Expressing pipeline parallelism.– Splitjoin Expressing task-level parallelism.– Other constructs not relevant here

• Exposes different types of parallelism– Composable, hierarchical

• Stateful and stateless filters

pipeline

filter

splitjoin

Page 26: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science26

Nonlinear Optimization Space

[Ryoo , CGO ’08]

SAD Optimization Space

908 Configurations

Page 27: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science27

Thread 1 Thread 2Thread 0

Bank Conflict

765432101514131211109876543210

A[8,8] A[8,8] A[8,8]

Shared Memory

Shared Memory 765432101514131211109876543210

Conflict

0 8 0

0 8 0

1 9 1

1 9 1

2 10 2

2 10 2

27

data = buffer[BaseAddress + s * ThreadId]

Page 28: Sponge: Portable Stream Programming on Graphics Engines

University of MichiganElectrical Engineering and Computer

Science28

Thread 2Thread 1Thread 0

Removing Bank Conflict

765432101514131211109876543210

A[8,8] A[8,8] A[8,8]

Shared Memory

Shared Memory 765432101514131211109876543210

0 9 2

0 9 2

1 10 3

1 10 3

2 11 4

2 11 4

28

data = buffer[BaseAddress + s * ThreadId]

if GCD( # of bank, s) is 1 there will be no bank conflict s must be odd