27
1 © 2013 The MathWorks, Inc. Accelerating System Simulations 김용정 부장 Senior Applications Engineer

김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

1 © 2013 The MathWorks, Inc.

Accelerating System Simulations

김용정 부장 Senior Applications Engineer

Page 2: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

2

Why simulation acceleration?

From algorithm exploration to system design – Size and complexity of models increases

– Time needed for a single simulation increases

– Number of test cases increases

– Test cases become larger

Need to reduce

– simulation time during design

– simulation time for large scale testing during prototyping

Page 3: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

3

MATLAB is quite fast

Optimized and widely-used libraries

– BLAS Basic Linear Algebra Subroutines (multithreaded)

– LAPACK Linear Algebra Package

JIT (Just In Time) Acceleration

– On-the-fly multithreaded code generation for increased speed

Built-in support for vector and matrix operations

Page 4: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

4

Application

LTE Physical Downlink Control Channel (PDCCH)

Page 5: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

5

Workflow

Start with a baseline algorithm

Profile it to introduce a performance yardstick

Introduce the following optimizations:

– Better MATLAB serial programming techniques

– Using System objects

– MATLAB to C code generation (MEX)

– Parallel Computing

– GPU-optimized System objects

– Rapid Accelerator mode of simulation in Simulink

Page 6: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

6

Simulation acceleration options in MATLAB

MATLAB to C

User’s Code

GPU

processing

Parallel

Computing

Better MATLAB

code

System objects

Page 7: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

7

Profiling MATLAB algorithms

Profiler summarizes

MATLAB code execution

– total time spent within each

function

– which lines of code use the

most processing time

Helps identify algorithm

bottlenecks

Page 8: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

8

Effective MATLAB programming techniques

Pre-allocation

– Initialize an array using its final size

– Helps avoid dynamically resizing arrays in a loop

Vectorization

– Convert code from using scalar loops to using matrix/vector

operations

– Helps MATLAB leverage processor-optimized libraries for

vector processing

Example of pre-allocation

y=[]; for n=1:LEN/Tx G=[u(idx1(n)) u(idx2(n));... -conj(u(idx2(n))) conj(u(idx1(n)))]; y=[y;G]; end

y=complex(zeros(LEN,Tx)); y(idx1,1)=u(idx1); y(idx1,2)=u(idx2); y(idx2,1)=-conj(u(idx2)); y(idx2,2)=conj(u(idx1));

Page 9: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

9

Using System objects of

DSP & Communications System Toolboxes

System objects facilitate stream processing

Can accelerate simulation because

– Decouple declaration from the execution of the algorithms

– Reduce overhead of parameter handling in the loop

– Most of them implemented as MATLAB executables (MEX)

Example of System objects

function s = Alamouti_DecoderS(u,H) %#codegen % STBC Combiner persistent hTDDec if isempty(hTDDec) hTDDec= comm.OSTBCCombiner(... 'NumTransmitAntennas',2,'NumReceiveAntennas',2); end s = step(hTDDec, u, H);

Page 10: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

10

MATLAB to C code generation

MATLAB Coder

Automatically generate

a MEX function

Call the generated MEX

file within testbench

Verify same numerical

results

Assess the baseline

function and the

generated MEX function

for speed

Page 11: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

11

Task 1 Task 2 Task 3 Task 4 Task 1 Task 2 Task 3 Task 4

Parallel Simulation Runs

Time Time

TOOLBOXES

BLOCKSETS

Worker

Worker

Worker

Worker

>> Demo

Page 12: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

12

Summary

matlabpool available workers

No modification of algorithm

Use parfor loop instead of for loop

Parallel computation or simulation

leads to further acceleration

More cores = more speed

Page 13: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

13

Simulation acceleration options in MATLAB

MATLAB to C

User’s Code

GPU

processing

Parallel

Computing

Better MATLAB

code

System objects

Page 14: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

14

What is a Graphics Processing Unit (GPU)

Originally for graphics acceleration, now also used for

scientific calculations

Massively parallel array of integer and

floating point processors

– Typically hundreds of processors per card

– GPU cores complement CPU cores

Dedicated high-speed memory

Page 15: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

15

Why would you want to use a GPU?

Speed up execution of computationally intensive

simulations

For example:

– Performance: A\b with Double Precision

Page 16: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

16

Options for Targeting GPUs

1) Use GPU with MATLAB built-in functions

2) Execute MATLAB functions elementwise

on the GPU

3) Create kernels from existing CUDA code

and PTX files

Ea

se

of

Us

e

Gre

ate

r Co

ntro

l

Page 17: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

17

Data Transfer between MATLAB and GPU

% Push data from CPU to GPU memory

Agpu = gpuArray(A)

% Bring results from GPU memory back to CPU

B = gather(Bgpu)

Page 18: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

18

GPU Processing with

Communications System Toolbox

Alternative implementation

for many System objects

take advantage of GPU

processing

Use Parallel Computing

Toolbox to execute many

communications algorithms

directly on the GPU

Easy-to-use syntax

Dramatically accelerate

simulations

GPU System objects

comm.gpu.TurboDecoder

comm.gpu.ViterbiDecoder

comm.gpu.LDPCDecoder

comm.gpu.PSKDemodulator

comm.gpu.AWGNChannel

Page 19: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

19

Impressive coding gain

High computational complexity

Bit-error rate performance as a function of number of

iterations

Example: Turbo Coding

= comm.TurboDecoder(…

‘NumIterations’, numIter,…

Page 20: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

20

Acceleration with GPU System objects

Version Elapsed time Acceleration

CPU 8 hours 1.0

1 GPU 40 minutes 12.0

Cluster of 4

GPUs

11 minutes 43.0

Same numerical results

= comm.TurboDecoder(…

‘NumIterations’, N,… = comm.gpu.TurboDecoder(…

‘NumIterations’, N,…

= comm.AWGNChannel(… = comm.gpu.AWGNChannel(…

Page 21: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

21

Key Operations in Turbo Coding Function

CPU GPU Version 1

% Turbo Encoder

hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

'InterleaverIndices', intrlvrIndices)

% AWG Noise

hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');

% BER measurement

hBER = comm.ErrorRate;

% Turbo Decoder

hTDec = comm.TurboDecoder(…

'TrellisStructure',poly2trellis(4, [13 15], 13),...

'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);

ber = zeros(3,1); %initialize BER output

%% Processing loop

while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)

data = randn(blkLength, 1)>0.5;

% Encode random data bits

yEnc = step(hTEnc, data);

%Modulate, Add noise to real bipolar data

modout = 1-2*yEnc;

rData = step(hAWGN, modout);

% Convert to log-likelihood ratios for decoding

llrData = (-2/noiseVar).*rData;

% Turbo Decode

decData = step(hTDec, llrData);

% Calculate errors

ber = step(hBER, data, decData);

end

% Turbo Encoder

hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

'InterleaverIndices', intrlvrIndices)

% AWG Noise

hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');

% BER measurement

hBER = comm.ErrorRate;

% Turbo Decoder

hTDec = comm.gpu.TurboDecoder(…

'TrellisStructure',poly2trellis(4, [13 15], 13),...

'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);

ber = zeros(3,1); %initialize BER output

%% Processing loop

while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)

data = randn(blkLength, 1)>0.5;

% Encode random data bits

yEnc = step(hTEnc, data);

%Modulate, Add noise to real bipolar data

modout = 1-2*yEnc;

rData = step(hAWGN, modout);

% Convert to log-likelihood ratios for decoding

llrData = (-2/noiseVar).*rData;

% Turbo Decode

decData = step(hTDec, llrData);

% Calculate errors

ber = step(hBER, data, decData);

end

Page 22: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

22

Profile results in Turbo Coding Function

CPU GPU Version 1

% Turbo Encoder

<0.01 hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

'InterleaverIndices', intrlvrIndices)

% AWG Noise

<0.01 hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');

% BER measurement

<0.01 hBER = comm.ErrorRate;

% Turbo Decoder

<0.01 hTDec = comm.TurboDecoder(…

'TrellisStructure',poly2trellis(4, [13 15], 13),...

'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);

<0.01 ber = zeros(3,1); %initialize BER output

%% Processing loop

while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)

0.30 data = randn(blkLength, 1)>0.5;

% Encode random data bits

2.33 yEnc = step(hTEnc, data);

%Modulate, Add noise to real bipolar data

0.05 modout = 1-2*yEnc;

1.50 rData = step(hAWGN, modout);

% Convert to log-likelihood ratios for decoding

0.03 llrData = (-2/noiseVar).*rData;

% Turbo Decode

330.54 decData = step(hTDec, llrData);

% Calculate errors

0.17 ber = step(hBER, data, decData);

end

% Turbo Encoder

<0.01 hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

'InterleaverIndices', intrlvrIndices)

% AWG Noise

<0.01 hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');

% BER measurement

<0.01 hBER = comm.ErrorRate;

% Turbo Decoder

0.02 hTDec = comm.gpu.TurboDecoder(…

'TrellisStructure',poly2trellis(4, [13 15], 13),...

'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);

<0.01 ber = zeros(3,1); %initialize BER output

%% Processing loop

while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)

0.28 data = randn(blkLength, 1)>0.5;

% Encode random data bits

2.38 yEnc = step(hTEnc, data);

%Modulate, Add noise to real bipolar data

0.05 modout = 1-2*yEnc;

1.45 rData = step(hAWGN, modout);

% Convert to log-likelihood ratios for decoding

0.04 llrData = (-2/noiseVar).*rData;

% Turbo Decode

98.18 decData = step(hTDec, llrData);

% Calculate errors

0.17 ber = step(hBER, data, decData);

end

Page 23: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

23

Key Operations in Turbo Coding Function

CPU GPU Version 2

% Turbo Encoder

hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

'InterleaverIndices', intrlvrIndices)

% AWG Noise

hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');

% BER measurement

hBER = comm.ErrorRate;

% Turbo Decoder

hTDec = comm.TurboDecoder('TrellisStructure',poly2trellis(4, [13 15], 13),...

'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);

%% Processing loop

while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)

data = randn(blkLength, 1)>0.5;

% Encode random data bits

yEnc = step(hTEnc, data);

%Modulate, Add noise to real bipolar data

modout = 1-2*yEnc;

rData = step(hAWGN, modout);

% Convert to log-likelihood ratios for decoding

llrData = (-2/noiseVar).*rData;

% Turbo Decode

decData = step(hTDec, llrData);

% Calculate errors

ber = step(hBER, data, decData);

end

% Turbo Encoder

hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

'InterleaverIndices', intrlvrIndices)

% AWG Noise

hAWGN = comm.gpu.AWGNChannel ('NoiseMethod', 'Variance');

% BER measurement

hBER = comm.ErrorRate;

% Turbo Decoder - setup for Multi-frame or Multi-user processing

numFrames = 30;

hTDec = comm.gpu.TurboDecoder('TrellisStructure',poly2trellis(4, [13 15], 13),...

'InterleaverIndices', intrlvrIndices,'NumIterations',numIter,…

’NumFrames’,numFrames);

%% Processing loop

while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)

data = randn(numFrames*blkLength, 1)>0.5;

% Encode random data bits

yEnc = gpuArray(multiframeStep(hTEnc, data, numFrames));

%Modulate, Add noise to real bipolar data

modout = 1-2*yEnc;

rData = step(hAWGN, modout);

% Convert to log-likelihood ratios for decoding

llrData = (-2/noiseVar).*rData;

% Turbo Decode

decData = step(hTDec, llrData);

% Calculate errors

ber=step(hBER, data, gather(decData));

end

Page 24: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

24

Profile results in Turbo Coding Function

CPU GPU Version 2

% Turbo Encoder

<0.01 hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

'InterleaverIndices', intrlvrIndices)

% AWG Noise

<0.01 hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');

% BER measurement

<0.01 hBER = comm.ErrorRate;

% Turbo Decoder

<0.01 hTDec = comm.TurboDecoder(…

'TrellisStructure',poly2trellis(4, [13 15], 13),...

'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);

%% Processing loop

while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)

0.30 data = randn(blkLength, 1)>0.5;

% Encode random data bits

2.33 yEnc = step(hTEnc, data);

%Modulate, Add noise to real bipolar data

0.05 modout = 1-2*yEnc;

1.50 rData = step(hAWGN, modout);

% Convert to log-likelihood ratios for decoding

0.03 llrData = (-2/noiseVar).*rData;

% Turbo Decode

330.54 decData = step(hTDec, llrData);

% Calculate errors

0.17 ber = step(hBER, data, decData);

end

% Turbo Encoder

<0.01 hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

'InterleaverIndices', intrlvrIndices)

% AWG Noise

0.03 hAWGN = comm.gpu.AWGNChannel ('NoiseMethod', 'Variance');

% BER measurement

<0.01 hBER = comm.ErrorRate;

% Turbo Decoder - setup for Multi-frame or Multi-user processing

0.01 numFrames = 30;

0.01 hTDec = comm.gpu.TurboDecoder('TrellisStructure',…

poly2trellis(4, [13 15], 13),'InterleaverIndices', intrlvrIndices,

'NumIterations',numIter, ’NumFrames’,numFrames);

%% Processing loop

while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)

0.22 data = randn(numFrames*blkLength, 1)>0.5;

% Encode random data bits

2.45 yEnc = gpuArray(multiframeStep(hTEnc, data, numFrames));

%Modulate, Add noise to real bipolar data

0.02 modout = 1-2*yEnc;

0.31 rData = step(hAWGN, modout);

% Convert to log-likelihood ratios for decoding

0.01 llrData = (-2/noiseVar).*rData;

% Turbo Decode

20.89 decData = step(hTDec, llrData);

% Calculate errors

0.09 ber=step(hBER, data, gather(decData));

end

Page 25: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

25

Things to note when targeting GPU

Minimize data transfer between CPU and GPU.

Using GPU only makes sense if data size is large.

Some functions in MATLAB are optimized and can be

faster than the GPU equivalent (eg. FFT).

Use arrayfun to explicitly specify elementwise

operations.

Page 26: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

26

Summary

Acceleration methodologies in MATLAB & Simulink Technology / Product

1. Best Practices in Programming • Vectorization & pre-allocation • Environment tools. (i.e. Profiler, Code Analyzer)

MATLAB, Toolboxes, System Toolboxes

2. Better Algorithms • Ideal environment for algorithm exploration • Rich set of functionality (e.g. System objects)

MATLAB, Toolboxes, System Toolboxes

3. More Processors or Cores

• High level parallel constructs (e.g. parfor, matlabpool)

• Utilize cluster, clouds, and grids

Parallel Computing Toolbox, MATLAB Distributed Computing Server

4. Refactoring the Implementation • Compiled code (MEX) • GPUs, FPGA-in-the-Loop

MATLAB, MATLAB Coder, Parallel Computing Toolbox

Page 27: 김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2 Why simulation acceleration? ... – Number of test cases increases – Test cases

27

Q & A

Thank You