Upload
dangcong
View
249
Download
1
Embed Size (px)
Citation preview
Leveraging DSP from Linux & RTOS
Parallel Processing & Software Support (OpenMP, OpenCL)
1
Agenda•Digital signal processing in the marketplace• Introduction to parallel processing (multiple cores)•Software support for parallel processing (OpenMP, OpenCL)•Hands‐on Lab: Develop OpenCL simple application
2
Digital Signal Processing in the Marketplace
Leveraging DSP from Linux & RTOS
3
Traditional DSP Applications
AudioImage Processing
Video Processing Speech Processing Communication
Radar/Sonar Medical Imaging Seismology
4
Finance /Stock Market
Characteristics of DSP
Time‐Frequency
Dot Product, Linear Algebra Search and compareTime Frequency
FFT, FIR, IIR, Convolution Correlation
Wide variety of other algorithms that involve signal/data processing
TI DSP Architectures
C54X: Classic DSP architecture
Typical Instruction:
MAC *AR5+ #1234h AC6000 family ‐multiple functional units
MAC *AR5+, #1234h, A
6
More than 300 assembly instructions (C66x)
Introduction to Parallel Processing:Introduction to Parallel Processing:Multiple CoresLeveraging DSP from Linux & RTOS
Embedded Processing
7
Multicore: Forefront of Computing Technology
“We’re not going to have faster processors. Instead, making software run faster in the future will mean using parallel programming techniques. This will be a huge shift.” ‐‐ Katherine Yelick, Associate Laboratory Director for Computing Sciences,
Lawrence Berkeley National Laboratoryfrom The Economist: Parallel Bars
8
Parallel Processing: Master‐Slave Model
Master
SlaveSlaveSlave
• Centralized control and distributed execution
• Master is responsible for execution, scheduling, and data availability.
• For SoCs with ARM the ARM core can be the master core and DSP cores be the slaves• For SoCs with ARM, the ARM core can be the master core and DSP cores be the slaves9
Parallel Processing: Data Flow Model
Core 2Core 1Core 0Core 0
• Distributed control and execution
• The algorithm is partitioned into multiple blocks:– Each block is processed by a coreEach block is processed by a core.– The output of one core is the input to the next core.– Data and messages are exchanged between all cores
• Partition the algorithm to optimize performance• Partition the algorithm to optimize performance10
Partitioning ConsiderationsC t i l ith b t d lti l i ll l?• Can a certain algorithm be executed on multiple cores in parallel?– Can the data be divided between two cores?– FIR filter can be, IIR filter cannotWh h d d i b ( ) l i h ?• What are the dependencies between two (or more) algorithms?– Can they be processed in parallel?– Must one algorithm wait for the previous one to finish?E l Id tifi ti b d fi i t d f iti b d iExample: Identification based on fingerprint and face recognition can be done in parallel. Pre‐filter and then image reconstruction in CT must be done in sequence.
• Can the application run concurrently on two sets of data?JPEG2000 video encoder; Yes– JPEG2000 video encoder; Yes
– H264 video encoder; No
11
Common Partitioning MethodsF ti d i P titi• Function‐driven Partition
– Large tasks are divided into function blocks– Function blocks are assigned to each core– The output of one core is the input of the next core
Core 2Core 1Core 0
The output of one core is the input of the next core
• Data‐driven Partition– Large data sets are divided into smaller data sets– All cores perform the same process on different blocks of
Core 0p p
data
•Mixed Partition – Consists of both function‐driven and data‐driven
partitioning
Data Core1
partitioningCore 2
12
Video Compression Algorithm:Function‐Driven Partition • Video compression is performed frame after frame.• Within a frame, the processing is done row after row.• Video compression is done on macroblocks (16x16 pixels).Video compression is done on macroblocks (16x16 pixels).• Video compression can be divided into three parts:
– Pre‐processing– Main processing– Main processing– Post‐processing
13
Algorithm for Very Large DFT:Data‐Driven Partition
10)()(1 *2
NkenxnyN nk
Nj
1,,0)()(0
Nkenxnyn
A very large DFT of size N=N1*N2 can be computed as follows:
1) Formulate input into N1xN2 matrix
2) Compute N2 FFTs size N1) p
3) Multiply global twiddle factors
4) Matrix transpose: N2xN1 ‐> N1xN2
14
5) Compute N1 FFTs. Each is N2 size.
Implementing VLFFT on Multiple Cores• 1st iteration
– 1024 FFTs (size 1024) are distributed across all the corescores
– Each core implements matrix transpose and computes 128 FFTs and multiplying global twiddle factor
Core 0
Synchronization
• 2nd iteration1024 FFT f 1024 i di t ib t d ll th
Data Core1
– 1024 FFTs of 1024 size are distributed across all the cores.
– Each core computes 128 FFTs and implements matrix
Core 2
transpose before and after FFT computation.15
Software Support for Parallel Processing
Leveraging DSP from Linux & RTOS
Embedded Processing
16
OpenMP: Parallel Languagefor Homogenous Model gOpenMP:• Master Slave Model• Based on fork and joint• Based on fork and joint• Thread distribution
OpenMP standards:• API for writing multi‐threaded applicationspp
• API includes compiler directives and library routines
• C, C++, and Fortran supportC, C , and Fortran support17
OpenMP Example: Multiple Hello World
18
OpenMP: One Last Example
#pragma omp parallel #pragma omp for
for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }for(i=0;i<N;i++) { a[i] = a[i] + b[i]; }
19
OpenCL Parallel Languagefor Heterogeneous Model g
• The content of this slide originates from the OpenCL standards body Khronos. • AM57x has ARMCortex‐A15 as a host, and DSP cores as accelerators
TI i li i h O CL 1 120
• TI is compliant with OpenCL 1.1
OpenCL Platform Model
• A host is connected to one or more OpenCL devices• An OpenCL device is a collection of one or more compute units
21
• An OpenCL device is a collection of one or more compute units • Compute unit may have multiple processing elements
OpenCL TI Platform Model• ARM A15 is the host: Commands are submitted from the host to the OpenCL devices (execution and
memory move).
• All C66 CorePacs are OpenCL devices. Each DSP core is a compute unit.An OpenCL device is viewed by the OpenCL programmer as a single virtual processor Which means theAn OpenCL device is viewed by the OpenCL programmer as a single virtual processor. Which means, the programmer does not need to know how many cores are in the device. The OpenCL runtime efficiently divides the total processing effort across the cores.NOTE: AM57x and K2H12 have the same OpenCL code.
66AK2H12KeyStone II Multicore DSP + ARM
++**‐‐<<<<+*‐<<
++**‐‐<<<<+*‐<< ++** +*++** +*
ARM A15ARM A15ARM A15ARM A15ARM A15ARM A15
ARM A15ARM A15ARM A15ARM A15ARM A15ARM A15
C66x DSPC66x DSPC66x DSPC66x DSPC66x DSPC66x DSP++‐‐<<<<
C66x DSPC66x DSP
+‐<<
C66x DSP
++‐‐<<<<
C66x DSPC66x DSP
+‐<<
C66x DSP++**‐‐<<<<
C66x DSPC66x DSP
+*‐<<
C66x DSP
++**‐‐<<<<
C66x DSPC66x DSP
+*‐<<
C66x DSP++**‐‐<<<<
C66x DSPC66x DSP
+*‐<<
C66x DSP
++**‐‐<<<<
C66x DSPC66x DSP
+*‐<<
C66x DSP
M lti Sh d MM lti Sh d M
22
Multicore Shared MemoryMulticore Shared Memory
OpenCL Applications Model Serial Code Host
Parallel Code Multiple DSP cores
Serial Code Host
Parallel Code Multiple DSP cores
Execution ModelMemory Modely
23
OpenCL Execution ModelHostContext
Define device and state
Compute DeviceOne or more Compute Unit(s)
Processing AlgorithmOne or more kernel(s)
Compute UnitOne or more Compute Element(s)
Work Group
Compute (Processing) ElementWork Item
24
Work items => Work group
OpenCL Execution ModelDefinitions
ContextContextDevice
Command queueGlobal buffers
B ild K lBuild KernelsGet source from file (or part of the code) and compile it at run‐time
ORGet binaries, either as stand‐alone .out or from a library
Manipulate Memory & BuffersMove data and define local memory
ExecuteDispatch all work items
25
Dispatch all work items
OpenCL Execution Model
Manipulate Memory & Buffers
Definitions
Manipulate Memory & Buffers
Build Kernels
Execute
2626
Kernel Definition – part of main.cpp
27
• Private Memory
OpenCL Memory Modely
– Per work‐item
• Local Memory
Shared within a workgroup local to a compute unit (core)
WorkWork‐‐ItemItemWorkWork‐‐ItemItem WorkWork‐‐ItemItemWorkWork‐‐ItemItem
Private Memory
Private Memory
Private Memory
Private Memory
– Shared within a workgroup, local to a compute unit (core) • Global/Constant Memory
– Shared across all compute units (cores) in a device Workgroup Workgroup
Local MemoryLocal Memory
Global/Constant Memory
• Host Memory– Attached to the Host CPU– Can be distinct from global memory
• Read / Write buffer model
Computer Device
HostHost Memory
• Read / Write buffer model– Can be same as global memory
• Map / Unmap buffer model Memory management is explicit Commands to move data from
© Copyright Khronos Group, 200928
Commands to move data fromhost ‐> global ‐> local and back
For More Information• Hands‐on Lab: Develop OpenCL Simple Application (PDF attached, see Resources)Hands on Lab: Develop OpenCL Simple Application (PDF attached, see Resources)
• TI OpenCL Wiki: http://processors.wiki.ti.com/index.php/OpenCL
• For questions regarding topics covered in this training, visit the support forums at the TI E2E Community website.
29