Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research

Conclusion

Parallel Computing With CUDA

May 25, 2011

Ryan AlbrightOregon State University

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research

Conclusion

Outline

1 Introduction to CUDA

2 Hardware

3 Software

4 Research

ParallelComputing

With CUDA

Outline

Introductionto CUDA

About CUDA

Hardware

Software

Research

Conclusion

Outline


2 Hardware

3 Software

4 Research

ParallelComputing

With CUDA

Outline

Introductionto CUDA

About CUDA

Hardware

Software

Research

Conclusion

About CUDA

Compute Unified Device Architecture

CUDA is both an Architecture and a Programming Style

Scalable performance based on number of ”cores”

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Architecture

Software

Research

Conclusion

Outline


2 Hardware

3 Software

4 Research

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Architecture

Software

Research

Conclusion

SIMD

Single Instruction Multiple Decode

Optimizes for dense repetitive tasks

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Architecture

Software

Research

Conclusion

GF100 Architecture

Graphics Processing Clusters

Broken Into Simultaneous Multiprocessors

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Architecture

Software

Research

Conclusion

What is inside a CUDA Core?

Each core also has 1 Shader Module

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Architecture

Software

Research

Conclusion

Some Example Cards...

We are using the GTS450

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Architecture

Software

Research

Conclusion

Compare with ATI

How is this different than ATI’s Cards

ATI Typically has more ”Processors” (Really Just ALU’s)

NVIDIA has fewer but more independant ”Cores” (WithShaders)

Often times performance is very similar

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Architecture

Software

Research

Conclusion

ATI vs NVIDIA

Benchmark

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA

Basic Steps Tofollow

Research

Conclusion

Outline


2 Hardware

3 Software

4 Research

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA


Research

Conclusion

Software

Languages Supported

C, C++, JAVAPerl, Python, RubyMatlab, MathematicaFortran, Haskell, Lua, IDL, .NET

Limited to NVIDIA Architectures

Really Really Really Easy to set up...

Many Useable standard libraries

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA


Research

Conclusion

Setting Up CUDA

Go to: ”http://developer.nvidia.com/cuda-toolkit-sdk”

Download Drivers, CUDA Toolkit, SDK & Examples, and”Getting Started” manual

Install Drivers, unpack the rest (automaticly putseverything in the right place)

run ”make” in the C directory to set up libraries.

run ”make” in any of the example projects directories tobuild the project.

If this works, CUDA is set up properly

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA


Research

Conclusion

Basic Steps To Use CUDA

Allocate Local Memory

Allocate GPU Memory

Store Values in Local Memory

Pass Values to GPU Memory

Give GPU and instruction

Pass Results Back

Clean Up GPU Memory

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA


Research

Conclusion

sgemmExample

Stands for Single Precision General Matrix Multiply

Also Called Dense Matrix Multiply

Requires multiplying two N x N Matrices

Many Simple multiplies work well on highly parallelprocessors

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA


Research

Conclusion

Memory Allocation: Host

/∗ A l l o c a t e hos t memory f o r the ma t r i c e s ∗/h A = ( f l o a t ∗) m a l l o c ( n2 ∗ s i z e o f ( h A [ 0 ] ) ) ;i f ( h A == 0) {

f p r i n t f ( s t d e r r , ” ! ! ! ! h o s t memory a l l o c a t i o n e r r o r (A)\n” ) ;r e t u r n EXIT FAILURE ;

}h B = ( f l o a t ∗) m a l l o c ( n2 ∗ s i z e o f ( h B [ 0 ] ) ) ;i f ( h B == 0) {

f p r i n t f ( s t d e r r , ” ! ! ! ! h o s t memory a l l o c a t i o n e r r o r (B)\n” ) ;r e t u r n EXIT FAILURE ;

}h C = ( f l o a t ∗) m a l l o c ( n2 ∗ s i z e o f ( h C [ 0 ] ) ) ;i f ( h C == 0) {

f p r i n t f ( s t d e r r , ” ! ! ! ! h o s t memory a l l o c a t i o n e r r o r (C)\n” ) ;r e t u r n EXIT FAILURE ;

}

/∗ F i l l the ma t r i c e s w i th t e s t data ∗/f o r ( i = 0 ; i < n2 ; i ++) {

h A [ i ] = rand ( ) / ( f l o a t )RAND MAX;h B [ i ] = rand ( ) / ( f l o a t )RAND MAX;h C [ i ] = rand ( ) / ( f l o a t )RAND MAX;

}

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA


Research

Conclusion

Memory Allocation: Device

/∗ A l l o c a t e d e v i c e memory f o r the ma t r i c e s ∗/s t a t u s = c u b l a s A l l o c ( n2 , s i z e o f ( d A [ 0 ] ) , ( vo id∗∗)&d A ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {

f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e memory a l l o c a t i o n e r r o r (A)\n” ) ;r e t u r n EXIT FAILURE ;

}s t a t u s = c u b l a s A l l o c ( n2 , s i z e o f ( d B [ 0 ] ) , ( vo id∗∗)&d B ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {

f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e memory a l l o c a t i o n e r r o r (B)\n” ) ;r e t u r n EXIT FAILURE ;

}s t a t u s = c u b l a s A l l o c ( n2 , s i z e o f ( d C [ 0 ] ) , ( vo id∗∗)&d C ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {

f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e memory a l l o c a t i o n e r r o r (C)\n” ) ;r e t u r n EXIT FAILURE ;

}/∗ I n i t i a l i z e the d e v i c e ma t r i c e s w i th the hos t ma t r i c e s ∗/s t a t u s = c u b l a s S e t V e c t o r ( n2 , s i z e o f ( h A [ 0 ] ) , h A , 1 , d A , 1 ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {

f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e a c c e s s e r r o r ( w r i t e A)\n” ) ;r e t u r n EXIT FAILURE ;

}s t a t u s = c u b l a s S e t V e c t o r ( n2 , s i z e o f ( h B [ 0 ] ) , h B , 1 , d B , 1 ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {

f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e a c c e s s e r r o r ( w r i t e B)\n” ) ;r e t u r n EXIT FAILURE ;

}s t a t u s = c u b l a s S e t V e c t o r ( n2 , s i z e o f ( h C [ 0 ] ) , h C , 1 , d C , 1 ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {

f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e a c c e s s e r r o r ( w r i t e C)\n” ) ;r e t u r n EXIT FAILURE ;

}

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA


Research

Conclusion

Multiply and Return Values

cublasSgemm ( ’ n ’ , ’ n ’ , N, N, N, a lpha , d A , N, d B , N, beta , d C , N ) ;s t a t u s = c u b l a s G e t E r r o r ( ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {

f p r i n t f ( s t d e r r , ” ! ! ! ! k e r n e l e x e c u t i o n e r r o r .\n” ) ;r e t u r n EXIT FAILURE ;

}

/∗ A l l o c a t e hos t memory f o r r e a d i n g back the r e s u l t from de v i c e memory ∗/h C = ( f l o a t ∗) m a l l o c ( n2 ∗ s i z e o f ( h C [ 0 ] ) ) ;i f ( h C == 0) {

f p r i n t f ( s t d e r r , ” ! ! ! ! h o s t memory a l l o c a t i o n e r r o r (C)\n” ) ;r e t u r n EXIT FAILURE ;

}

/∗ Read the r e s u l t back ∗/s t a t u s = c u b l a s G e t V e c t o r ( n2 , s i z e o f ( h C [ 0 ] ) , d C , 1 , h C , 1 ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {

r e t u r n EXIT FAILURE ;}

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA


Research

Conclusion

Speedup: Timing Logs

sgemm t e s t r u n n i n g . .Standard C v e r s i o n took : 35785 usCUBLAS SGEMM v e r s i o n took : 628 usPASSED

P r e s s ENTER to e x i t . . .

Using GTS450 (192 Cores)

Matrix sizes of 256 were used

Why isnt this 192x Faster?

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research

Energy EfficientComputing

GPU Character-ization

Conclusion

Outline


2 Hardware

3 Software

4 Research

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research



Conclusion

Energy Efficient Computing

What does this mean?

Typically measured in Performance/Watt

How do systems do this today?

Embedded Systems (Go Go Go .... Sleep)Typical PC (Uh... We dont need that core right now)Super Computer (Ummm... I don’t care! More dataplease!)

How can we improve this?

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research



Conclusion

Power Saving Techniques

Power Gating

Turn off sections that arent being used

Lower Supply Voltage

Usually Requires slowing down the processorPower is related to Voltage Squared

P = CV 2f

Clock Gating

Works in synchronous systemsDisables portions so less of the circuit switches statesCan lose speed, performance, and induce errors (lesscircuitry to check)

How can we save power without losing performance?

We believe this should be done by optimizing the systemfor the job it is performing at any given time.

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research



Conclusion

Optimize Hardware settings for Algorithms

We are looking into optimizing hardware settings based on thealgorithms being run. These include:

Memory Bandwidth

Main Clock Frequency

Supply Voltage

Number of Cores

Whatever else we can get our hands on...

A good example is computation vs memory bandwidth limitedalgorithms

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research



Conclusion

Why a GPU?

Mass Parallel Execution (NVIDIA GTS 450 has over 190Cores)

Overclocking and Voltage Scalable

Supply easily interrupted for Power measurements

CUDA Environment and Libraries

Good Relationship with NVIDIA

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research



Conclusion

Algorithms We are Looking At

Dense matrix multiply

Brute force Matrix multiplyMemory IntensiveComputationally Simple for highly parallel architecture

Sparse matrix multiply

Matrices typically with lots of zerosTypically can be compressedLess memory intensiveComputationally more complicated

These should allow for fairly straight forward comparisonsfor two algorithms that are known to have theselimitations.

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research

Conclusion

Conclusion/Recap


2 Hardware

3 Software

4 Research

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research

Conclusion

Questions

Questions?

Documents

Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power