Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Leveraging DSP from Linux & RTOS

Parallel Processing & Software Support (OpenMP, OpenCL)

1

Agenda•Digital signal processing in the marketplace• Introduction to parallel processing (multiple cores)•Software support for parallel processing (OpenMP, OpenCL)•Hands‐on Lab: Develop OpenCL simple application

2

Digital Signal Processing in the Marketplace


3

Traditional DSP Applications

AudioImage Processing

Video Processing Speech Processing Communication

Radar/Sonar Medical Imaging Seismology

4

Finance /Stock Market

Characteristics of DSP

Time‐Frequency

Dot Product, Linear Algebra Search and compareTime Frequency

FFT, FIR, IIR, Convolution Correlation

Wide variety of other algorithms that involve signal/data processing

TI DSP Architectures

C54X: Classic DSP architecture

Typical Instruction:

MAC *AR5+ #1234h AC6000 family ‐multiple functional units

MAC *AR5+, #1234h, A

6

More than 300 assembly instructions (C66x)

Introduction to Parallel Processing:Introduction to Parallel Processing:Multiple CoresLeveraging DSP from Linux & RTOS

Embedded Processing

7

Multicore: Forefront of Computing Technology

“We’re not going to have faster processors. Instead, making software run faster in the future will mean using parallel programming techniques. This will be a huge shift.” ‐‐ Katherine Yelick, Associate Laboratory Director for Computing Sciences,

Lawrence Berkeley National Laboratoryfrom The Economist: Parallel Bars

8

Parallel Processing: Master‐Slave Model

Master

SlaveSlaveSlave

• Centralized control and distributed execution

• Master is responsible for execution, scheduling, and data availability.

• For SoCs with ARM the ARM core can be the master core and DSP cores be the slaves• For SoCs with ARM, the ARM core can be the master core and DSP cores be the slaves9

Parallel Processing: Data Flow Model

Core 2Core 1Core 0Core 0

• Distributed control and execution

• The algorithm is partitioned into multiple blocks:– Each block is processed by a coreEach block is processed by a core.– The output of one core is the input to the next core.– Data and messages are exchanged between all cores

• Partition the algorithm to optimize performance• Partition the algorithm to optimize performance10

Partitioning ConsiderationsC t i l ith b t d lti l i ll l?• Can a certain algorithm be executed on multiple cores in parallel?– Can the data be divided between two cores?– FIR filter can be, IIR filter cannotWh h d d i b ( ) l i h ?• What are the dependencies between two (or more) algorithms?– Can they be processed in parallel?– Must one algorithm wait for the previous one to finish?E l Id tifi ti b d fi i t d f iti b d iExample: Identification based on fingerprint and face recognition can be done in parallel. Pre‐filter and then image reconstruction in CT must be done in sequence.

• Can the application run concurrently on two sets of data?JPEG2000 video encoder; Yes– JPEG2000 video encoder; Yes

– H264 video encoder; No

11

Common Partitioning MethodsF ti d i P titi• Function‐driven Partition

– Large tasks are divided into function blocks– Function blocks are assigned to each core– The output of one core is the input of the next core

Core 2Core 1Core 0

The output of one core is the input of the next core

• Data‐driven Partition– Large data sets are divided into smaller data sets– All cores perform the same process on different blocks of

Core 0p p

data

•Mixed Partition – Consists of both function‐driven and data‐driven

partitioning

Data Core1

partitioningCore 2

12

Video Compression Algorithm:Function‐Driven Partition • Video compression is performed frame after frame.• Within a frame, the processing is done row after row.• Video compression is done on macroblocks (16x16 pixels).Video compression is done on macroblocks (16x16 pixels).• Video compression can be divided into three parts:

– Pre‐processing– Main processing– Main processing– Post‐processing

13

Algorithm for Very Large DFT:Data‐Driven Partition

10)()(1 *2

NkenxnyN nk

Nj

1,,0)()(0

Nkenxnyn

A very large DFT of size N=N1*N2 can be computed as follows:

1) Formulate input into N1xN2 matrix

2) Compute N2 FFTs size N1) p

3) Multiply global twiddle factors

4) Matrix transpose: N2xN1 ‐> N1xN2

14

5) Compute N1 FFTs. Each is N2 size.

Implementing VLFFT on Multiple Cores• 1st iteration

– 1024 FFTs (size 1024) are distributed across all the corescores

– Each core implements matrix transpose and computes 128 FFTs and multiplying global twiddle factor

Core 0

Synchronization

• 2nd iteration1024 FFT f 1024 i di t ib t d ll th

Data Core1

– 1024 FFTs of 1024 size are distributed across all the cores.

– Each core computes 128 FFTs and implements matrix

Core 2

transpose before and after FFT computation.15

Software Support for Parallel Processing


Embedded Processing

16

OpenMP: Parallel Languagefor Homogenous Model gOpenMP:• Master Slave Model• Based on fork and joint• Based on fork and joint• Thread distribution

OpenMP standards:• API for writing multi‐threaded applicationspp

• API includes compiler directives and library routines

• C, C++, and Fortran supportC, C , and Fortran support17

OpenMP Example: Multiple Hello World

18

OpenMP: One Last Example

#pragma omp parallel #pragma omp for

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

19

OpenCL Parallel Languagefor Heterogeneous Model g

• The content of this slide originates from the OpenCL standards body Khronos. • AM57x has ARMCortex‐A15 as a host, and DSP cores as accelerators

TI i li i h O CL 1 120

• TI is compliant with OpenCL 1.1

OpenCL Platform Model

• A host is connected to one or more OpenCL devices• An OpenCL device is a collection of one or more compute units

21

• An OpenCL device is a collection of one or more compute units • Compute unit may have multiple processing elements

OpenCL TI Platform Model• ARM A15 is the host: Commands are submitted from the host to the OpenCL devices (execution and

memory move).

• All C66 CorePacs are OpenCL devices. Each DSP core is a compute unit.An OpenCL device is viewed by the OpenCL programmer as a single virtual processor Which means theAn OpenCL device is viewed by the OpenCL programmer as a single virtual processor. Which means, the programmer does not need to know how many cores are in the device. The OpenCL runtime efficiently divides the total processing effort across the cores.NOTE: AM57x and K2H12 have the same OpenCL code.

66AK2H12KeyStone II Multicore DSP + ARM

++**‐‐<<<<+*‐<<

++**‐‐<<<<+*‐<< ++** +*++** +*

ARM A15ARM A15ARM A15ARM A15ARM A15ARM A15

ARM A15ARM A15ARM A15ARM A15ARM A15ARM A15

C66x DSPC66x DSPC66x DSPC66x DSPC66x DSPC66x DSP++‐‐<<<<

C66x DSPC66x DSP

+‐<<

C66x DSP

++‐‐<<<<

C66x DSPC66x DSP

+‐<<

C66x DSP++**‐‐<<<<

C66x DSPC66x DSP

+*‐<<

C66x DSP

++**‐‐<<<<

C66x DSPC66x DSP

+*‐<<

C66x DSP++**‐‐<<<<

C66x DSPC66x DSP

+*‐<<

C66x DSP

++**‐‐<<<<

C66x DSPC66x DSP

+*‐<<

C66x DSP

M lti Sh d MM lti Sh d M

22

Multicore Shared MemoryMulticore Shared Memory

OpenCL Applications Model Serial Code Host

Parallel Code Multiple DSP cores

Serial Code Host

Parallel Code Multiple DSP cores

Execution ModelMemory Modely

23

OpenCL Execution ModelHostContext

Define device and state

Compute DeviceOne or more Compute Unit(s)

Processing AlgorithmOne or more kernel(s)

Compute UnitOne or more Compute Element(s)

Work Group

Compute (Processing) ElementWork Item

24

Work items => Work group

OpenCL Execution ModelDefinitions

ContextContextDevice

Command queueGlobal buffers

B ild K lBuild KernelsGet source from file (or part of the code) and compile it at run‐time

ORGet binaries, either as stand‐alone .out or from a library

Manipulate Memory & BuffersMove data and define local memory

ExecuteDispatch all work items

25

Dispatch all work items

OpenCL Execution Model

Manipulate Memory & Buffers

Definitions

Manipulate Memory & Buffers

Build Kernels

Execute

2626

Kernel Definition – part of main.cpp

27

• Private Memory

OpenCL Memory Modely

– Per work‐item

• Local Memory

Shared within a workgroup local to a compute unit (core)

WorkWork‐‐ItemItemWorkWork‐‐ItemItem WorkWork‐‐ItemItemWorkWork‐‐ItemItem

Private Memory

Private Memory

Private Memory

Private Memory

– Shared within a workgroup, local to a compute unit (core) • Global/Constant Memory

– Shared across all compute units (cores) in a device Workgroup Workgroup

Local MemoryLocal Memory

Global/Constant Memory

• Host Memory– Attached to the Host CPU– Can be distinct from global memory

• Read / Write buffer model

Computer Device

HostHost Memory

• Read / Write buffer model– Can be same as global memory

• Map / Unmap buffer model Memory management is explicit Commands to move data from

© Copyright Khronos Group, 200928

Commands to move data fromhost ‐> global ‐> local and back

For More Information• Hands‐on Lab: Develop OpenCL Simple Application (PDF attached, see Resources)Hands on Lab: Develop OpenCL Simple Application (PDF attached, see Resources)

• TI OpenCL Wiki: http://processors.wiki.ti.com/index.php/OpenCL

• For questions regarding topics covered in this training, visit the support forums at the TI E2E Community website.

29

Documents

Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)