29
Leveraging DSP from Linux & RTOS Parallel Processing & Software Support (OpenMP, OpenCL) 1

Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Embed Size (px)

Citation preview

Page 1: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Leveraging DSP from Linux & RTOS

Parallel Processing & Software Support (OpenMP, OpenCL)

1

Page 2: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Agenda•Digital signal processing in the marketplace• Introduction to parallel processing (multiple cores)•Software support for parallel processing (OpenMP, OpenCL)•Hands‐on Lab: Develop OpenCL simple application

2

Page 3: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Digital Signal Processing in the Marketplace

Leveraging DSP from Linux & RTOS

3

Page 4: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Traditional DSP Applications

AudioImage Processing

Video Processing Speech Processing Communication

Radar/Sonar Medical Imaging Seismology

4

Finance /Stock Market 

Page 5: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Characteristics of DSP 

Time‐Frequency

Dot Product, Linear Algebra Search and compareTime Frequency 

FFT, FIR, IIR, Convolution Correlation

Wide variety of other algorithms that involve signal/data processing

Page 6: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

TI DSP Architectures

C54X: Classic DSP architecture

Typical Instruction:

MAC *AR5+ #1234h AC6000 family ‐multiple functional units

MAC *AR5+, #1234h, A

6

More than 300 assembly instructions (C66x)

Page 7: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Introduction to Parallel Processing:Introduction to Parallel Processing:Multiple CoresLeveraging DSP from Linux & RTOS

Embedded Processing

7

Page 8: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Multicore: Forefront of Computing Technology 

“We’re not going to have faster processors. Instead, making software run faster in the future will mean using parallel programming techniques. This will be a huge shift.” ‐‐ Katherine Yelick, Associate Laboratory Director for Computing Sciences,

Lawrence Berkeley National Laboratoryfrom The Economist: Parallel Bars

8

Page 9: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Parallel Processing: Master‐Slave Model 

Master

SlaveSlaveSlave

• Centralized control and distributed execution

• Master is responsible for execution, scheduling, and data availability.

• For SoCs with ARM the ARM core can be the master core and DSP cores be the slaves• For SoCs with ARM, the ARM core can be the master core and DSP cores be the slaves9

Page 10: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Parallel Processing: Data Flow Model

Core 2Core 1Core 0Core 0

• Distributed control and execution

• The algorithm is partitioned into multiple blocks:– Each block is processed by a coreEach block is processed by a core.– The output of one core is the input to the next core.– Data and messages are exchanged between all cores

• Partition the algorithm to optimize performance• Partition the algorithm to optimize performance10

Page 11: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Partitioning ConsiderationsC t i l ith b t d lti l i ll l?• Can a certain algorithm be executed on multiple cores in parallel?– Can the data be divided between two cores?– FIR filter can be, IIR filter cannotWh h d d i b ( ) l i h ?• What are the dependencies between two (or more) algorithms?– Can they be processed in parallel?– Must one algorithm wait for the previous one to finish?E l Id tifi ti b d fi i t d f iti b d iExample: Identification based on fingerprint and face recognition can be done in parallel. Pre‐filter and then image reconstruction in CT must be done in sequence.

• Can the application run concurrently on two sets of data?JPEG2000 video encoder; Yes– JPEG2000 video encoder; Yes

– H264 video encoder; No

11

Page 12: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Common Partitioning MethodsF ti d i P titi• Function‐driven Partition

– Large tasks are divided into function blocks– Function blocks are assigned to each core– The output of one core is the input of the next core

Core 2Core 1Core 0

The output of one core is the input of the next core

• Data‐driven Partition– Large data sets are divided into smaller data sets– All cores perform the same process on different blocks of 

Core 0p p

data

•Mixed Partition – Consists of both function‐driven and data‐driven 

partitioning

Data Core1

partitioningCore 2

12

Page 13: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Video Compression Algorithm:Function‐Driven Partition • Video compression is performed frame after frame.• Within a frame, the processing is done row after row.• Video compression is done on macroblocks (16x16 pixels).Video compression is done on macroblocks (16x16 pixels).• Video compression can be divided into three parts:

– Pre‐processing– Main processing– Main processing– Post‐processing

13

Page 14: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Algorithm for Very Large DFT:Data‐Driven Partition 

10)()(1 *2

NkenxnyN nk

Nj

1,,0)()(0

Nkenxnyn

A very large DFT of size N=N1*N2 can be computed as follows:

1) Formulate input into N1xN2 matrix

2) Compute N2 FFTs size N1) p

3) Multiply global twiddle factors

4) Matrix transpose: N2xN1 ‐> N1xN2

14

5) Compute N1 FFTs. Each is N2 size.

Page 15: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Implementing VLFFT on Multiple Cores• 1st iteration

– 1024 FFTs (size 1024) are distributed across all the corescores

– Each core implements matrix transpose and computes 128 FFTs and multiplying global twiddle factor

Core 0

Synchronization 

• 2nd iteration1024 FFT f 1024 i di t ib t d ll th

Data Core1

– 1024 FFTs of 1024 size are distributed across all the cores.

– Each core computes 128 FFTs and implements matrix 

Core 2

transpose before and after FFT computation.15

Page 16: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Software Support for Parallel Processing

Leveraging DSP from Linux & RTOS

Embedded Processing

16

Page 17: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

OpenMP: Parallel Languagefor Homogenous Model gOpenMP:• Master Slave Model• Based on fork and joint• Based on fork and joint• Thread distribution 

OpenMP standards:• API for writing multi‐threaded applicationspp

• API includes compiler directives and library routines

• C, C++, and Fortran supportC, C , and Fortran support17

Page 18: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

OpenMP Example: Multiple Hello World 

18

Page 19: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

OpenMP: One Last Example 

#pragma omp parallel #pragma omp for

for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }

19

Page 20: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

OpenCL Parallel Languagefor Heterogeneous Model g

• The content of this slide originates from the OpenCL standards body Khronos. • AM57x has ARMCortex‐A15 as a host, and DSP cores as accelerators

TI i li i h O CL 1 120

• TI is compliant with OpenCL 1.1 

Page 21: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

OpenCL Platform Model

• A host is connected to one or more OpenCL devices• An OpenCL device is a collection of one or more compute units

21

• An OpenCL device is a collection of one or more compute units • Compute unit may have multiple processing elements

Page 22: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

OpenCL TI Platform Model• ARM A15 is the host: Commands are submitted from the host to the OpenCL devices (execution and 

memory move).

• All C66 CorePacs are OpenCL devices. Each DSP core is a compute unit.An OpenCL device is viewed by the OpenCL programmer as a single virtual processor Which means theAn OpenCL device is viewed by the OpenCL programmer as a single virtual processor. Which means, the programmer does not need to know how many cores are in the device.  The OpenCL runtime efficiently divides the total processing effort across the cores.NOTE: AM57x and K2H12 have the same OpenCL code.

66AK2H12KeyStone II Multicore DSP + ARM

++**‐‐<<<<+*‐<<

++**‐‐<<<<+*‐<< ++** +*++** +*

ARM A15ARM A15ARM A15ARM A15ARM A15ARM A15

ARM A15ARM A15ARM A15ARM A15ARM A15ARM A15

C66x DSPC66x DSPC66x DSPC66x DSPC66x DSPC66x DSP++‐‐<<<<

C66x DSPC66x DSP

+‐<<

C66x DSP

++‐‐<<<<

C66x DSPC66x DSP

+‐<<

C66x DSP++**‐‐<<<<

C66x DSPC66x DSP

+*‐<<

C66x DSP

++**‐‐<<<<

C66x DSPC66x DSP

+*‐<<

C66x DSP++**‐‐<<<<

C66x DSPC66x DSP

+*‐<<

C66x DSP

++**‐‐<<<<

C66x DSPC66x DSP

+*‐<<

C66x DSP

M lti Sh d MM lti Sh d M

22

Multicore Shared MemoryMulticore Shared Memory

Page 23: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

OpenCL Applications Model  Serial Code Host

Parallel Code Multiple DSP cores

Serial Code Host

Parallel Code Multiple DSP cores

Execution ModelMemory Modely

23

Page 24: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

OpenCL Execution ModelHostContext

Define device and state

Compute DeviceOne or more Compute Unit(s)

Processing AlgorithmOne or more kernel(s)

Compute UnitOne or more Compute Element(s)

Work Group

Compute (Processing) ElementWork Item

24

Work items => Work group

Page 25: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

OpenCL Execution ModelDefinitions

ContextContextDevice

Command queueGlobal buffers

B ild K lBuild KernelsGet source from file (or part of the code) and compile it at run‐time

ORGet binaries, either as stand‐alone .out or from a library

Manipulate Memory & BuffersMove data and define local memory

ExecuteDispatch all work items

25

Dispatch all work items

Page 26: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

OpenCL Execution Model

Manipulate Memory & Buffers

Definitions

Manipulate Memory & Buffers

Build Kernels

Execute

2626

Page 27: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

Kernel Definition – part of main.cpp

27

Page 28: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

• Private Memory

OpenCL Memory Modely

– Per work‐item

• Local Memory

Shared within a workgroup local to a compute unit (core)

WorkWork‐‐ItemItemWorkWork‐‐ItemItem WorkWork‐‐ItemItemWorkWork‐‐ItemItem

Private Memory

Private Memory

Private Memory

Private Memory

– Shared within a workgroup, local to a compute unit (core) • Global/Constant Memory

– Shared across all compute units (cores) in a device Workgroup Workgroup

Local MemoryLocal Memory

Global/Constant Memory

• Host Memory– Attached to the Host CPU– Can be distinct from global memory

• Read / Write buffer model

Computer Device

HostHost Memory

• Read / Write buffer model– Can be same as global memory

• Map / Unmap buffer model Memory management is explicit Commands to move data from

© Copyright Khronos Group, 200928

Commands to move data fromhost ‐> global ‐> local and back

Page 29: Leveraging DSP from Linux RTOS - TI Training · Leveraging DSP from Linux & RTOS 3. ... Classic DSP architecture Typical Instruction: MAC*AR5+ #1234h A ... (PDF attached, see Resources)

For More Information• Hands‐on Lab: Develop OpenCL Simple Application (PDF attached, see Resources)Hands on Lab: Develop OpenCL Simple Application (PDF attached, see Resources)

• TI OpenCL Wiki: http://processors.wiki.ti.com/index.php/OpenCL

• For questions regarding topics covered in this training, visit the support forums at the TI E2E Community website.

29