28
Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Conclusion Parallel Computing With CUDA May 25, 2011 Ryan Albright Oregon State University

Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

  • Upload
    others

  • View
    49

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research

Conclusion

Parallel Computing With CUDA

May 25, 2011

Ryan AlbrightOregon State University

Page 2: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research

Conclusion

Outline

1 Introduction to CUDA

2 Hardware

3 Software

4 Research

Page 3: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

About CUDA

Hardware

Software

Research

Conclusion

Outline

1 Introduction to CUDA

2 Hardware

3 Software

4 Research

Page 4: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

About CUDA

Hardware

Software

Research

Conclusion

About CUDA

Compute Unified Device Architecture

CUDA is both an Architecture and a Programming Style

Scalable performance based on number of ”cores”

Page 5: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Architecture

Software

Research

Conclusion

Outline

1 Introduction to CUDA

2 Hardware

3 Software

4 Research

Page 6: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Architecture

Software

Research

Conclusion

SIMD

Single Instruction Multiple Decode

Optimizes for dense repetitive tasks

Page 7: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Architecture

Software

Research

Conclusion

GF100 Architecture

Graphics Processing Clusters

Broken Into Simultaneous Multiprocessors

Page 8: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Architecture

Software

Research

Conclusion

What is inside a CUDA Core?

Each core also has 1 Shader Module

Page 9: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Architecture

Software

Research

Conclusion

Some Example Cards...

We are using the GTS450

Page 10: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Architecture

Software

Research

Conclusion

Compare with ATI

How is this different than ATI’s Cards

ATI Typically has more ”Processors” (Really Just ALU’s)

NVIDIA has fewer but more independant ”Cores” (WithShaders)

Often times performance is very similar

Page 11: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Architecture

Software

Research

Conclusion

ATI vs NVIDIA

Benchmark

Page 12: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA

Basic Steps Tofollow

Research

Conclusion

Outline

1 Introduction to CUDA

2 Hardware

3 Software

4 Research

Page 13: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA

Basic Steps Tofollow

Research

Conclusion

Software

Languages Supported

C, C++, JAVAPerl, Python, RubyMatlab, MathematicaFortran, Haskell, Lua, IDL, .NET

Limited to NVIDIA Architectures

Really Really Really Easy to set up...

Many Useable standard libraries

Page 14: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA

Basic Steps Tofollow

Research

Conclusion

Setting Up CUDA

Go to: ”http://developer.nvidia.com/cuda-toolkit-sdk”

Download Drivers, CUDA Toolkit, SDK & Examples, and”Getting Started” manual

Install Drivers, unpack the rest (automaticly putseverything in the right place)

run ”make” in the C directory to set up libraries.

run ”make” in any of the example projects directories tobuild the project.

If this works, CUDA is set up properly

Page 15: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA

Basic Steps Tofollow

Research

Conclusion

Basic Steps To Use CUDA

Allocate Local Memory

Allocate GPU Memory

Store Values in Local Memory

Pass Values to GPU Memory

Give GPU and instruction

Pass Results Back

Clean Up GPU Memory

Page 16: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA

Basic Steps Tofollow

Research

Conclusion

sgemmExample

Stands for Single Precision General Matrix Multiply

Also Called Dense Matrix Multiply

Requires multiplying two N x N Matrices

Many Simple multiplies work well on highly parallelprocessors

Page 17: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA

Basic Steps Tofollow

Research

Conclusion

Memory Allocation: Host

/∗ A l l o c a t e hos t memory f o r the ma t r i c e s ∗/h A = ( f l o a t ∗) m a l l o c ( n2 ∗ s i z e o f ( h A [ 0 ] ) ) ;i f ( h A == 0) {

f p r i n t f ( s t d e r r , ” ! ! ! ! h o s t memory a l l o c a t i o n e r r o r (A)\n” ) ;r e t u r n EXIT FAILURE ;

}h B = ( f l o a t ∗) m a l l o c ( n2 ∗ s i z e o f ( h B [ 0 ] ) ) ;i f ( h B == 0) {

f p r i n t f ( s t d e r r , ” ! ! ! ! h o s t memory a l l o c a t i o n e r r o r (B)\n” ) ;r e t u r n EXIT FAILURE ;

}h C = ( f l o a t ∗) m a l l o c ( n2 ∗ s i z e o f ( h C [ 0 ] ) ) ;i f ( h C == 0) {

f p r i n t f ( s t d e r r , ” ! ! ! ! h o s t memory a l l o c a t i o n e r r o r (C)\n” ) ;r e t u r n EXIT FAILURE ;

}

/∗ F i l l the ma t r i c e s w i th t e s t data ∗/f o r ( i = 0 ; i < n2 ; i ++) {

h A [ i ] = rand ( ) / ( f l o a t )RAND MAX;h B [ i ] = rand ( ) / ( f l o a t )RAND MAX;h C [ i ] = rand ( ) / ( f l o a t )RAND MAX;

}

Page 18: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA

Basic Steps Tofollow

Research

Conclusion

Memory Allocation: Device

/∗ A l l o c a t e d e v i c e memory f o r the ma t r i c e s ∗/s t a t u s = c u b l a s A l l o c ( n2 , s i z e o f ( d A [ 0 ] ) , ( vo id∗∗)&d A ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {

f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e memory a l l o c a t i o n e r r o r (A)\n” ) ;r e t u r n EXIT FAILURE ;

}s t a t u s = c u b l a s A l l o c ( n2 , s i z e o f ( d B [ 0 ] ) , ( vo id∗∗)&d B ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {

f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e memory a l l o c a t i o n e r r o r (B)\n” ) ;r e t u r n EXIT FAILURE ;

}s t a t u s = c u b l a s A l l o c ( n2 , s i z e o f ( d C [ 0 ] ) , ( vo id∗∗)&d C ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {

f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e memory a l l o c a t i o n e r r o r (C)\n” ) ;r e t u r n EXIT FAILURE ;

}/∗ I n i t i a l i z e the d e v i c e ma t r i c e s w i th the hos t ma t r i c e s ∗/s t a t u s = c u b l a s S e t V e c t o r ( n2 , s i z e o f ( h A [ 0 ] ) , h A , 1 , d A , 1 ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {

f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e a c c e s s e r r o r ( w r i t e A)\n” ) ;r e t u r n EXIT FAILURE ;

}s t a t u s = c u b l a s S e t V e c t o r ( n2 , s i z e o f ( h B [ 0 ] ) , h B , 1 , d B , 1 ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {

f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e a c c e s s e r r o r ( w r i t e B)\n” ) ;r e t u r n EXIT FAILURE ;

}s t a t u s = c u b l a s S e t V e c t o r ( n2 , s i z e o f ( h C [ 0 ] ) , h C , 1 , d C , 1 ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {

f p r i n t f ( s t d e r r , ” ! ! ! ! d e v i c e a c c e s s e r r o r ( w r i t e C)\n” ) ;r e t u r n EXIT FAILURE ;

}

Page 19: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA

Basic Steps Tofollow

Research

Conclusion

Multiply and Return Values

cublasSgemm ( ’ n ’ , ’ n ’ , N, N, N, a lpha , d A , N, d B , N, beta , d C , N ) ;s t a t u s = c u b l a s G e t E r r o r ( ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {

f p r i n t f ( s t d e r r , ” ! ! ! ! k e r n e l e x e c u t i o n e r r o r .\n” ) ;r e t u r n EXIT FAILURE ;

}

/∗ A l l o c a t e hos t memory f o r r e a d i n g back the r e s u l t from de v i c e memory ∗/h C = ( f l o a t ∗) m a l l o c ( n2 ∗ s i z e o f ( h C [ 0 ] ) ) ;i f ( h C == 0) {

f p r i n t f ( s t d e r r , ” ! ! ! ! h o s t memory a l l o c a t i o n e r r o r (C)\n” ) ;r e t u r n EXIT FAILURE ;

}

/∗ Read the r e s u l t back ∗/s t a t u s = c u b l a s G e t V e c t o r ( n2 , s i z e o f ( h C [ 0 ] ) , d C , 1 , h C , 1 ) ;i f ( s t a t u s != CUBLAS STATUS SUCCESS) {

r e t u r n EXIT FAILURE ;}

Page 20: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Highlights

Using CUDA

Basic Steps Tofollow

Research

Conclusion

Speedup: Timing Logs

sgemm t e s t r u n n i n g . .Standard C v e r s i o n took : 35785 usCUBLAS SGEMM v e r s i o n took : 628 usPASSED

P r e s s ENTER to e x i t . . .

Using GTS450 (192 Cores)

Matrix sizes of 256 were used

Why isnt this 192x Faster?

Page 21: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research

Energy EfficientComputing

GPU Character-ization

Conclusion

Outline

1 Introduction to CUDA

2 Hardware

3 Software

4 Research

Page 22: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research

Energy EfficientComputing

GPU Character-ization

Conclusion

Energy Efficient Computing

What does this mean?

Typically measured in Performance/Watt

How do systems do this today?

Embedded Systems (Go Go Go .... Sleep)Typical PC (Uh... We dont need that core right now)Super Computer (Ummm... I don’t care! More dataplease!)

How can we improve this?

Page 23: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research

Energy EfficientComputing

GPU Character-ization

Conclusion

Power Saving Techniques

Power Gating

Turn off sections that arent being used

Lower Supply Voltage

Usually Requires slowing down the processorPower is related to Voltage Squared

P = CV 2f

Clock Gating

Works in synchronous systemsDisables portions so less of the circuit switches statesCan lose speed, performance, and induce errors (lesscircuitry to check)

How can we save power without losing performance?

We believe this should be done by optimizing the systemfor the job it is performing at any given time.

Page 24: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research

Energy EfficientComputing

GPU Character-ization

Conclusion

Optimize Hardware settings for Algorithms

We are looking into optimizing hardware settings based on thealgorithms being run. These include:

Memory Bandwidth

Main Clock Frequency

Supply Voltage

Number of Cores

Whatever else we can get our hands on...

A good example is computation vs memory bandwidth limitedalgorithms

Page 25: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research

Energy EfficientComputing

GPU Character-ization

Conclusion

Why a GPU?

Mass Parallel Execution (NVIDIA GTS 450 has over 190Cores)

Overclocking and Voltage Scalable

Supply easily interrupted for Power measurements

CUDA Environment and Libraries

Good Relationship with NVIDIA

Page 26: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research

Energy EfficientComputing

GPU Character-ization

Conclusion

Algorithms We are Looking At

Dense matrix multiply

Brute force Matrix multiplyMemory IntensiveComputationally Simple for highly parallel architecture

Sparse matrix multiply

Matrices typically with lots of zerosTypically can be compressedLess memory intensiveComputationally more complicated

These should allow for fairly straight forward comparisonsfor two algorithms that are known to have theselimitations.

Page 27: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research

Conclusion

Conclusion/Recap

1 Introduction to CUDA

2 Hardware

3 Software

4 Research

Page 28: Parallel Computing With CUDA...Parallel Computing With CUDA Outline Introduction to CUDA Hardware Software Research Energy E cient Computing GPU Character-ization Conclusion Power

ParallelComputing

With CUDA

Outline

Introductionto CUDA

Hardware

Software

Research

Conclusion

Questions

Questions?