김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2...

Preview:

Citation preview

1 © 2013 The MathWorks, Inc.

Accelerating System Simulations

김용정 부장 Senior Applications Engineer

2

Why simulation acceleration?

From algorithm exploration to system design – Size and complexity of models increases

– Time needed for a single simulation increases

– Number of test cases increases

– Test cases become larger

Need to reduce

– simulation time during design

– simulation time for large scale testing during prototyping

3

MATLAB is quite fast

Optimized and widely-used libraries

– BLAS Basic Linear Algebra Subroutines (multithreaded)

– LAPACK Linear Algebra Package

JIT (Just In Time) Acceleration

– On-the-fly multithreaded code generation for increased speed

Built-in support for vector and matrix operations

4

Application

LTE Physical Downlink Control Channel (PDCCH)

5

Workflow

Start with a baseline algorithm

Profile it to introduce a performance yardstick

Introduce the following optimizations:

– Better MATLAB serial programming techniques

– Using System objects

– MATLAB to C code generation (MEX)

– Parallel Computing

– GPU-optimized System objects

– Rapid Accelerator mode of simulation in Simulink

6

Simulation acceleration options in MATLAB

MATLAB to C

User’s Code

GPU

processing

Parallel

Computing

Better MATLAB

code

System objects

7

Profiling MATLAB algorithms

Profiler summarizes

MATLAB code execution

– total time spent within each

function

– which lines of code use the

most processing time

Helps identify algorithm

bottlenecks

8

Effective MATLAB programming techniques

Pre-allocation

– Initialize an array using its final size

– Helps avoid dynamically resizing arrays in a loop

Vectorization

– Convert code from using scalar loops to using matrix/vector

operations

– Helps MATLAB leverage processor-optimized libraries for

vector processing

Example of pre-allocation

y=[]; for n=1:LEN/Tx G=[u(idx1(n)) u(idx2(n));... -conj(u(idx2(n))) conj(u(idx1(n)))]; y=[y;G]; end

y=complex(zeros(LEN,Tx)); y(idx1,1)=u(idx1); y(idx1,2)=u(idx2); y(idx2,1)=-conj(u(idx2)); y(idx2,2)=conj(u(idx1));

9

Using System objects of

DSP & Communications System Toolboxes

System objects facilitate stream processing

Can accelerate simulation because

– Decouple declaration from the execution of the algorithms

– Reduce overhead of parameter handling in the loop

– Most of them implemented as MATLAB executables (MEX)

Example of System objects

function s = Alamouti_DecoderS(u,H) %#codegen % STBC Combiner persistent hTDDec if isempty(hTDDec) hTDDec= comm.OSTBCCombiner(... 'NumTransmitAntennas',2,'NumReceiveAntennas',2); end s = step(hTDDec, u, H);

10

MATLAB to C code generation

MATLAB Coder

Automatically generate

a MEX function

Call the generated MEX

file within testbench

Verify same numerical

results

Assess the baseline

function and the

generated MEX function

for speed

11

Task 1 Task 2 Task 3 Task 4 Task 1 Task 2 Task 3 Task 4

Parallel Simulation Runs

Time Time

TOOLBOXES

BLOCKSETS

Worker

Worker

Worker

Worker

>> Demo

12

Summary

matlabpool available workers

No modification of algorithm

Use parfor loop instead of for loop

Parallel computation or simulation

leads to further acceleration

More cores = more speed

13

Simulation acceleration options in MATLAB

MATLAB to C

User’s Code

GPU

processing

Parallel

Computing

Better MATLAB

code

System objects

14

What is a Graphics Processing Unit (GPU)

Originally for graphics acceleration, now also used for

scientific calculations

Massively parallel array of integer and

floating point processors

– Typically hundreds of processors per card

– GPU cores complement CPU cores

Dedicated high-speed memory

15

Why would you want to use a GPU?

Speed up execution of computationally intensive

simulations

For example:

– Performance: A\b with Double Precision

16

Options for Targeting GPUs

1) Use GPU with MATLAB built-in functions

2) Execute MATLAB functions elementwise

on the GPU

3) Create kernels from existing CUDA code

and PTX files

Ea

se

of

Us

e

Gre

ate

r Co

ntro

l

17

Data Transfer between MATLAB and GPU

% Push data from CPU to GPU memory

Agpu = gpuArray(A)

% Bring results from GPU memory back to CPU

B = gather(Bgpu)

18

GPU Processing with

Communications System Toolbox

Alternative implementation

for many System objects

take advantage of GPU

processing

Use Parallel Computing

Toolbox to execute many

communications algorithms

directly on the GPU

Easy-to-use syntax

Dramatically accelerate

simulations

GPU System objects

comm.gpu.TurboDecoder

comm.gpu.ViterbiDecoder

comm.gpu.LDPCDecoder

comm.gpu.PSKDemodulator

comm.gpu.AWGNChannel

19

Impressive coding gain

High computational complexity

Bit-error rate performance as a function of number of

iterations

Example: Turbo Coding

= comm.TurboDecoder(…

‘NumIterations’, numIter,…

20

Acceleration with GPU System objects

Version Elapsed time Acceleration

CPU 8 hours 1.0

1 GPU 40 minutes 12.0

Cluster of 4

GPUs

11 minutes 43.0

Same numerical results

= comm.TurboDecoder(…

‘NumIterations’, N,… = comm.gpu.TurboDecoder(…

‘NumIterations’, N,…

= comm.AWGNChannel(… = comm.gpu.AWGNChannel(…

21

Key Operations in Turbo Coding Function

CPU GPU Version 1

% Turbo Encoder

hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

'InterleaverIndices', intrlvrIndices)

% AWG Noise

hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');

% BER measurement

hBER = comm.ErrorRate;

% Turbo Decoder

hTDec = comm.TurboDecoder(…

'TrellisStructure',poly2trellis(4, [13 15], 13),...

'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);

ber = zeros(3,1); %initialize BER output

%% Processing loop

while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)

data = randn(blkLength, 1)>0.5;

% Encode random data bits

yEnc = step(hTEnc, data);

%Modulate, Add noise to real bipolar data

modout = 1-2*yEnc;

rData = step(hAWGN, modout);

% Convert to log-likelihood ratios for decoding

llrData = (-2/noiseVar).*rData;

% Turbo Decode

decData = step(hTDec, llrData);

% Calculate errors

ber = step(hBER, data, decData);

end

% Turbo Encoder

hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

'InterleaverIndices', intrlvrIndices)

% AWG Noise

hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');

% BER measurement

hBER = comm.ErrorRate;

% Turbo Decoder

hTDec = comm.gpu.TurboDecoder(…

'TrellisStructure',poly2trellis(4, [13 15], 13),...

'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);

ber = zeros(3,1); %initialize BER output

%% Processing loop

while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)

data = randn(blkLength, 1)>0.5;

% Encode random data bits

yEnc = step(hTEnc, data);

%Modulate, Add noise to real bipolar data

modout = 1-2*yEnc;

rData = step(hAWGN, modout);

% Convert to log-likelihood ratios for decoding

llrData = (-2/noiseVar).*rData;

% Turbo Decode

decData = step(hTDec, llrData);

% Calculate errors

ber = step(hBER, data, decData);

end

22

Profile results in Turbo Coding Function

CPU GPU Version 1

% Turbo Encoder

<0.01 hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

'InterleaverIndices', intrlvrIndices)

% AWG Noise

<0.01 hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');

% BER measurement

<0.01 hBER = comm.ErrorRate;

% Turbo Decoder

<0.01 hTDec = comm.TurboDecoder(…

'TrellisStructure',poly2trellis(4, [13 15], 13),...

'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);

<0.01 ber = zeros(3,1); %initialize BER output

%% Processing loop

while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)

0.30 data = randn(blkLength, 1)>0.5;

% Encode random data bits

2.33 yEnc = step(hTEnc, data);

%Modulate, Add noise to real bipolar data

0.05 modout = 1-2*yEnc;

1.50 rData = step(hAWGN, modout);

% Convert to log-likelihood ratios for decoding

0.03 llrData = (-2/noiseVar).*rData;

% Turbo Decode

330.54 decData = step(hTDec, llrData);

% Calculate errors

0.17 ber = step(hBER, data, decData);

end

% Turbo Encoder

<0.01 hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

'InterleaverIndices', intrlvrIndices)

% AWG Noise

<0.01 hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');

% BER measurement

<0.01 hBER = comm.ErrorRate;

% Turbo Decoder

0.02 hTDec = comm.gpu.TurboDecoder(…

'TrellisStructure',poly2trellis(4, [13 15], 13),...

'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);

<0.01 ber = zeros(3,1); %initialize BER output

%% Processing loop

while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)

0.28 data = randn(blkLength, 1)>0.5;

% Encode random data bits

2.38 yEnc = step(hTEnc, data);

%Modulate, Add noise to real bipolar data

0.05 modout = 1-2*yEnc;

1.45 rData = step(hAWGN, modout);

% Convert to log-likelihood ratios for decoding

0.04 llrData = (-2/noiseVar).*rData;

% Turbo Decode

98.18 decData = step(hTDec, llrData);

% Calculate errors

0.17 ber = step(hBER, data, decData);

end

23

Key Operations in Turbo Coding Function

CPU GPU Version 2

% Turbo Encoder

hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

'InterleaverIndices', intrlvrIndices)

% AWG Noise

hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');

% BER measurement

hBER = comm.ErrorRate;

% Turbo Decoder

hTDec = comm.TurboDecoder('TrellisStructure',poly2trellis(4, [13 15], 13),...

'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);

%% Processing loop

while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)

data = randn(blkLength, 1)>0.5;

% Encode random data bits

yEnc = step(hTEnc, data);

%Modulate, Add noise to real bipolar data

modout = 1-2*yEnc;

rData = step(hAWGN, modout);

% Convert to log-likelihood ratios for decoding

llrData = (-2/noiseVar).*rData;

% Turbo Decode

decData = step(hTDec, llrData);

% Calculate errors

ber = step(hBER, data, decData);

end

% Turbo Encoder

hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

'InterleaverIndices', intrlvrIndices)

% AWG Noise

hAWGN = comm.gpu.AWGNChannel ('NoiseMethod', 'Variance');

% BER measurement

hBER = comm.ErrorRate;

% Turbo Decoder - setup for Multi-frame or Multi-user processing

numFrames = 30;

hTDec = comm.gpu.TurboDecoder('TrellisStructure',poly2trellis(4, [13 15], 13),...

'InterleaverIndices', intrlvrIndices,'NumIterations',numIter,…

’NumFrames’,numFrames);

%% Processing loop

while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)

data = randn(numFrames*blkLength, 1)>0.5;

% Encode random data bits

yEnc = gpuArray(multiframeStep(hTEnc, data, numFrames));

%Modulate, Add noise to real bipolar data

modout = 1-2*yEnc;

rData = step(hAWGN, modout);

% Convert to log-likelihood ratios for decoding

llrData = (-2/noiseVar).*rData;

% Turbo Decode

decData = step(hTDec, llrData);

% Calculate errors

ber=step(hBER, data, gather(decData));

end

24

Profile results in Turbo Coding Function

CPU GPU Version 2

% Turbo Encoder

<0.01 hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

'InterleaverIndices', intrlvrIndices)

% AWG Noise

<0.01 hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');

% BER measurement

<0.01 hBER = comm.ErrorRate;

% Turbo Decoder

<0.01 hTDec = comm.TurboDecoder(…

'TrellisStructure',poly2trellis(4, [13 15], 13),...

'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);

%% Processing loop

while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)

0.30 data = randn(blkLength, 1)>0.5;

% Encode random data bits

2.33 yEnc = step(hTEnc, data);

%Modulate, Add noise to real bipolar data

0.05 modout = 1-2*yEnc;

1.50 rData = step(hAWGN, modout);

% Convert to log-likelihood ratios for decoding

0.03 llrData = (-2/noiseVar).*rData;

% Turbo Decode

330.54 decData = step(hTDec, llrData);

% Calculate errors

0.17 ber = step(hBER, data, decData);

end

% Turbo Encoder

<0.01 hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

'InterleaverIndices', intrlvrIndices)

% AWG Noise

0.03 hAWGN = comm.gpu.AWGNChannel ('NoiseMethod', 'Variance');

% BER measurement

<0.01 hBER = comm.ErrorRate;

% Turbo Decoder - setup for Multi-frame or Multi-user processing

0.01 numFrames = 30;

0.01 hTDec = comm.gpu.TurboDecoder('TrellisStructure',…

poly2trellis(4, [13 15], 13),'InterleaverIndices', intrlvrIndices,

'NumIterations',numIter, ’NumFrames’,numFrames);

%% Processing loop

while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)

0.22 data = randn(numFrames*blkLength, 1)>0.5;

% Encode random data bits

2.45 yEnc = gpuArray(multiframeStep(hTEnc, data, numFrames));

%Modulate, Add noise to real bipolar data

0.02 modout = 1-2*yEnc;

0.31 rData = step(hAWGN, modout);

% Convert to log-likelihood ratios for decoding

0.01 llrData = (-2/noiseVar).*rData;

% Turbo Decode

20.89 decData = step(hTDec, llrData);

% Calculate errors

0.09 ber=step(hBER, data, gather(decData));

end

25

Things to note when targeting GPU

Minimize data transfer between CPU and GPU.

Using GPU only makes sense if data size is large.

Some functions in MATLAB are optimized and can be

faster than the GPU equivalent (eg. FFT).

Use arrayfun to explicitly specify elementwise

operations.

26

Summary

Acceleration methodologies in MATLAB & Simulink Technology / Product

1. Best Practices in Programming • Vectorization & pre-allocation • Environment tools. (i.e. Profiler, Code Analyzer)

MATLAB, Toolboxes, System Toolboxes

2. Better Algorithms • Ideal environment for algorithm exploration • Rich set of functionality (e.g. System objects)

MATLAB, Toolboxes, System Toolboxes

3. More Processors or Cores

• High level parallel constructs (e.g. parfor, matlabpool)

• Utilize cluster, clouds, and grids

Parallel Computing Toolbox, MATLAB Distributed Computing Server

4. Refactoring the Implementation • Compiled code (MEX) • GPUs, FPGA-in-the-Loop

MATLAB, MATLAB Coder, Parallel Computing Toolbox

27

Q & A

Thank You

Recommended