김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2...

Accelerating System Simulations

김용정 부장 Senior Applications Engineer

Why simulation acceleration?

From algorithm exploration to system design – Size and complexity of models increases

– Time needed for a single simulation increases

– Number of test cases increases

– Test cases become larger

Need to reduce

– simulation time during design

– simulation time for large scale testing during prototyping

MATLAB is quite fast

Optimized and widely-used libraries

– BLAS Basic Linear Algebra Subroutines (multithreaded)

– LAPACK Linear Algebra Package

JIT (Just In Time) Acceleration

– On-the-fly multithreaded code generation for increased speed

Built-in support for vector and matrix operations

Application

LTE Physical Downlink Control Channel (PDCCH)

Workflow

Start with a baseline algorithm

Profile it to introduce a performance yardstick

Introduce the following optimizations:

– Better MATLAB serial programming techniques

– Using System objects

– MATLAB to C code generation (MEX)

– Parallel Computing

– GPU-optimized System objects

– Rapid Accelerator mode of simulation in Simulink

Simulation acceleration options in MATLAB

MATLAB to C

User’s Code

processing

Parallel

Computing

Better MATLAB

System objects

Profiling MATLAB algorithms

Profiler summarizes

MATLAB code execution

– total time spent within each

function

– which lines of code use the

most processing time

Helps identify algorithm

bottlenecks

Effective MATLAB programming techniques

Pre-allocation

– Initialize an array using its final size

– Helps avoid dynamically resizing arrays in a loop

Vectorization

– Convert code from using scalar loops to using matrix/vector

operations

– Helps MATLAB leverage processor-optimized libraries for

vector processing

Example of pre-allocation

y=[]; for n=1:LEN/Tx G=[u(idx1(n)) u(idx2(n));... -conj(u(idx2(n))) conj(u(idx1(n)))]; y=[y;G]; end

y=complex(zeros(LEN,Tx)); y(idx1,1)=u(idx1); y(idx1,2)=u(idx2); y(idx2,1)=-conj(u(idx2)); y(idx2,2)=conj(u(idx1));

Using System objects of

DSP & Communications System Toolboxes

System objects facilitate stream processing

Can accelerate simulation because

– Decouple declaration from the execution of the algorithms

– Reduce overhead of parameter handling in the loop

– Most of them implemented as MATLAB executables (MEX)

Example of System objects

function s = Alamouti_DecoderS(u,H) %#codegen % STBC Combiner persistent hTDDec if isempty(hTDDec) hTDDec= comm.OSTBCCombiner(... 'NumTransmitAntennas',2,'NumReceiveAntennas',2); end s = step(hTDDec, u, H);

MATLAB to C code generation

MATLAB Coder

Automatically generate

a MEX function

Call the generated MEX

file within testbench

Verify same numerical

results

Assess the baseline

function and the

generated MEX function

for speed

Task 1 Task 2 Task 3 Task 4 Task 1 Task 2 Task 3 Task 4

Parallel Simulation Runs

Time Time

TOOLBOXES

BLOCKSETS

Worker

>> Demo

Summary

matlabpool available workers

No modification of algorithm

Use parfor loop instead of for loop

Parallel computation or simulation

leads to further acceleration

More cores = more speed

Simulation acceleration options in MATLAB

MATLAB to C

User’s Code

processing

Parallel

Computing

Better MATLAB

System objects

What is a Graphics Processing Unit (GPU)

Originally for graphics acceleration, now also used for

scientific calculations

Massively parallel array of integer and

floating point processors

– Typically hundreds of processors per card

– GPU cores complement CPU cores

Dedicated high-speed memory

Why would you want to use a GPU?

Speed up execution of computationally intensive

simulations

For example:

– Performance: A\b with Double Precision

Options for Targeting GPUs

1) Use GPU with MATLAB built-in functions

2) Execute MATLAB functions elementwise

on the GPU

3) Create kernels from existing CUDA code

and PTX files

Data Transfer between MATLAB and GPU

% Push data from CPU to GPU memory

Agpu = gpuArray(A)

% Bring results from GPU memory back to CPU

B = gather(Bgpu)

GPU Processing with

Communications System Toolbox

Alternative implementation

for many System objects

take advantage of GPU

processing

Use Parallel Computing

Toolbox to execute many

communications algorithms

directly on the GPU

Easy-to-use syntax

Dramatically accelerate

simulations

GPU System objects

comm.gpu.TurboDecoder

comm.gpu.ViterbiDecoder

comm.gpu.LDPCDecoder

comm.gpu.PSKDemodulator

comm.gpu.AWGNChannel

Impressive coding gain

High computational complexity

Bit-error rate performance as a function of number of

iterations

Example: Turbo Coding

= comm.TurboDecoder(…

‘NumIterations’, numIter,…

Acceleration with GPU System objects

Version Elapsed time Acceleration

CPU 8 hours 1.0

1 GPU 40 minutes 12.0

Cluster of 4

11 minutes 43.0

Same numerical results

= comm.TurboDecoder(…

‘NumIterations’, N,… = comm.gpu.TurboDecoder(…

‘NumIterations’, N,…

= comm.AWGNChannel(… = comm.gpu.AWGNChannel(…

Key Operations in Turbo Coding Function

CPU GPU Version 1

% Turbo Encoder

hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

'InterleaverIndices', intrlvrIndices)

% AWG Noise

hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');

% BER measurement

hBER = comm.ErrorRate;

% Turbo Decoder

hTDec = comm.TurboDecoder(…

'TrellisStructure',poly2trellis(4, [13 15], 13),...

'InterleaverIndices', intrlvrIndices,'NumIterations', numIter);

ber = zeros(3,1); %initialize BER output

%% Processing loop

while ( ber(1) < MaxNumErrs && ber(2) < MaxNumBits)

data = randn(blkLength, 1)>0.5;

% Encode random data bits

yEnc = step(hTEnc, data);

%Modulate, Add noise to real bipolar data

modout = 1-2*yEnc;

rData = step(hAWGN, modout);

% Convert to log-likelihood ratios for decoding

llrData = (-2/noiseVar).*rData;

% Turbo Decode

decData = step(hTDec, llrData);

% Calculate errors

ber = step(hBER, data, decData);

% Turbo Encoder

% AWG Noise

% BER measurement

% Turbo Decoder

hTDec = comm.gpu.TurboDecoder(…

ber = zeros(3,1); %initialize BER output

%% Processing loop

modout = 1-2*yEnc;

% Turbo Decode

% Calculate errors

Profile results in Turbo Coding Function

CPU GPU Version 1

% Turbo Encoder

<0.01 hTEnc = comm.TurboEncoder('TrellisStructure',poly2trellis(4, [13 15], 13),..

% AWG Noise

<0.01 hAWGN = comm.AWGNChannel('NoiseMethod', 'Variance');

% BER measurement

<0.01 hBER = comm.ErrorRate;

% Turbo Decoder

<0.01 hTDec = comm.TurboDecoder(…

<0.01 ber = zeros(3,1); %initialize BER output

%% Processing loop

0.30 data = randn(blkLength, 1)>0.5;

2.33 yEnc = step(hTEnc, data);

0.05 modout = 1-2*yEnc;

1.50 rData = step(hAWGN, modout);

0.03 llrData = (-2/noiseVar).*rData;

% Turbo Decode

330.54 decData = step(hTDec, llrData);

% Calculate errors

0.17 ber = step(hBER, data, decData);

% Turbo Encoder

% AWG Noise

% BER measurement

% Turbo Decoder

0.02 hTDec = comm.gpu.TurboDecoder(…

<0.01 ber = zeros(3,1); %initialize BER output

%% Processing loop

% Turbo Decode

% Calculate errors

Key Operations in Turbo Coding Function

CPU GPU Version 2

% Turbo Encoder

% AWG Noise

% BER measurement

% Turbo Decoder

hTDec = comm.TurboDecoder('TrellisStructure',poly2trellis(4, [13 15], 13),...

%% Processing loop

modout = 1-2*yEnc;

% Turbo Decode

% Calculate errors

% Turbo Encoder

% AWG Noise

hAWGN = comm.gpu.AWGNChannel ('NoiseMethod', 'Variance');

% BER measurement

% Turbo Decoder - setup for Multi-frame or Multi-user processing

numFrames = 30;

hTDec = comm.gpu.TurboDecoder('TrellisStructure',poly2trellis(4, [13 15], 13),...

'InterleaverIndices', intrlvrIndices,'NumIterations',numIter,…

’NumFrames’,numFrames);

%% Processing loop

data = randn(numFrames*blkLength, 1)>0.5;

yEnc = gpuArray(multiframeStep(hTEnc, data, numFrames));

modout = 1-2*yEnc;

% Turbo Decode

% Calculate errors

ber=step(hBER, data, gather(decData));

Profile results in Turbo Coding Function

CPU GPU Version 2

% Turbo Encoder

% AWG Noise

% BER measurement

% Turbo Decoder

<0.01 hTDec = comm.TurboDecoder(…

%% Processing loop

% Turbo Decode

% Calculate errors

% Turbo Encoder

% AWG Noise

0.03 hAWGN = comm.gpu.AWGNChannel ('NoiseMethod', 'Variance');

% BER measurement

% Turbo Decoder - setup for Multi-frame or Multi-user processing

0.01 numFrames = 30;

0.01 hTDec = comm.gpu.TurboDecoder('TrellisStructure',…

poly2trellis(4, [13 15], 13),'InterleaverIndices', intrlvrIndices,

'NumIterations',numIter, ’NumFrames’,numFrames);

%% Processing loop

0.22 data = randn(numFrames*blkLength, 1)>0.5;

2.45 yEnc = gpuArray(multiframeStep(hTEnc, data, numFrames));

% Turbo Decode

% Calculate errors

0.09 ber=step(hBER, data, gather(decData));

Things to note when targeting GPU

Minimize data transfer between CPU and GPU.

Using GPU only makes sense if data size is large.

Some functions in MATLAB are optimized and can be

faster than the GPU equivalent (eg. FFT).

Use arrayfun to explicitly specify elementwise

operations.

Summary

Acceleration methodologies in MATLAB & Simulink Technology / Product

1. Best Practices in Programming • Vectorization & pre-allocation • Environment tools. (i.e. Profiler, Code Analyzer)

MATLAB, Toolboxes, System Toolboxes

2. Better Algorithms • Ideal environment for algorithm exploration • Rich set of functionality (e.g. System objects)

MATLAB, Toolboxes, System Toolboxes

3. More Processors or Cores

• High level parallel constructs (e.g. parfor, matlabpool)

• Utilize cluster, clouds, and grids

Parallel Computing Toolbox, MATLAB Distributed Computing Server

4. Refactoring the Implementation • Compiled code (MEX) • GPUs, FPGA-in-the-Loop

MATLAB, MATLAB Coder, Parallel Computing Toolbox

Thank You

김용정 부장 Senior Applications Engineer · 김용정 부장 Senior Applications Engineer . 2...

Documents

CHIEF ENGINEER DESIGN ENGINEER

김용정 부장 Senior Applications Engineer · 3 Algorithm development and visualization difficult with generic programming languages Need access to ready-to-use libraries of signal

IIoT플랫폼을 설비예지진단시스템 방안 B5.pdf · IIoT플랫폼을 활용한 설비예지진단시스템 구축 방안 MDS테크놀로지㈜ IoT사업부 이철 부장

대륙별 국제공항의 효율성 및 생산성 연구 · 강수진(2016)세계 44개 공항ccr, bcc 직원 수 게이트 수 활주로 수 매출액 순이익 김용정 외(2013)아시아

마이크로소프트(김대우 부장)_AI startup D.PARTY_20161020

분양 . 임대 문의 : 이상철 부장 010-6244-8547

현대적 클라우드 데이터센터를 위한 오라클 리눅스 최신기술 (게스트: 김영중 부장, 오라클 코리아)

DATE DESIGN STANDARDS ENGINEER CHIEF HIGHWAY ENGINEER … · design standards engineer date. design standards engineer chief highway engineer date design standards engineer date

손한기 - Onnuri · 2018. 1. 20. · 복지재단 대표이사 상임이사 이사 감사 본부장 구분 부장 팀장 / 간사 이재훈 정호옥 조항진 송영범 차달수

[KorYEN - 여행 어디가?] EG TS 최임운 부장 - 행복한 캠핑이란?

한국통신 기업거래사업부 임근찬 부장

The New IP시대에 따른 가상 ADC(vADC)로의 진화(Brocade 이학수 부장)

M2M(사물지능통신) 발전방향과 과제 - 글·권오상 부장 정책연구본부 방송통신연구부

참가자 프로필 · 2018-11-26 · 참가자 프로필 395 ·유네스코 방콕 사무소 문화부 부장(동남아시아 문화 프로그램 담당) ·유네스코 본부 무형유산분과

CATS+ Labor Rates: Fairwinds Technologies LLC · Engineer, En ineer, Engineer, Engineer, En ineer, Engineer, Engineer, Engineer, Engineer, Interdisciplina Senio Network (Junior) Network

PowerPoint 프레젠테이션PowerPoint 프레젠테이션 Author 김희주(Heeju Kim) 부장 두산솔루스 Created Date 5/14/2020 9:41:23 AM

Autodesk Moldflow RoadShow 2015 - ednc.comœ„한-점검-포인트_이대희부장.pdf · Moldflow 해석의신뢰성확보를위한점검포인트 David Lee (이대희 부장) Sr

황순환 부장 - ednc.com · 계산 시간을 5분 드리겠습니다. Excel_CAN_사용자 테스트.xls에서 R과 H를 설계구간 내에서 바꿔가며 설계조건을 만족하는

[Ignite LG 2016 Spring] Cultural Observations, David Seperson 부장

monthly magazine vol. 20 / 2013 AUGUST - · PDF file먼저 ‘까칠하고 인색하고, 욕심 많고, 유아독존’의 캐릭터 ‘주’군입니다. ... 글 황성연 부장