How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

How to leverage Multicore Architecture for Compute Intensive Applications – FTF-SDS-F0598

Huang Yun

Wind River Confidential – NDA Disclosure.

Agenda

Freescale hardware

– QorIQ

– i.MX6

– SMP/AMP

Multi-core Software Archtectures

– MCAPI/MRAPI

– OpenMP

– OpenCL

– Cilk/Cilk++

– Proprietary

© 2014 Wind River

QorIQ

© 2014 Wind River

QorIQ T4240 CPU architecture

12 CPU Cores – e6500 64 bit

– 1.8 GHz at 1V

– Dual-threaded to 24 threads

– Hardware Virtualization

L1 Cache between two threads

L2 Cache shared by a cluster of 4 cores

© 2014 Wind River

i.MX6 Quad

© 2014 Wind River

i.MX6 Quad

© 2014 Wind River

4 CPU Cores – ARM Cortex A9 - 32 bit

– 1.2 GHz

L1 32 K I & 32K D Cache per core

L2 1 MB shared cache

Hardware Graphics Accelerator

– OpenGL & OpenCL capable

Maximizing Multi-core Benefits

The potential of multi-core platforms to deliver increased performance with less power consumption is not a guaranteed outcome.

Successfully mapping your single-core application to multi-core architectures is a journey challenged by what you don’t know…

All operating environments are not created equal when it comes to configuration options yielding maximum performance for your specific applications on your chosen multi-core platform

Single to Multi-Core

Core

1

OS

App 1

Single-Core

Multi-Core

Sub 1

Core

1

OS

App 2

Sub 2

Core

1

OS

App 3

Sub 3

Core

1

Core

2

OS

App 1

Core

3

Core

8

App 2 App 8

Multi-Core Architecture: SMP

Symmetric multiprocessing (SMP)

• Many computing resources for OS and applications to share

• Single RTOS and scheduler

• Priority based assumptions might cause timing issues

Core

1

Core

2

SMP

App 1

Core

3

Core

8

App 2 App 8

Best suited for

• Heavy processing tasks such as data manipulation and image processing

Not as suitable for

• Hard real-time response requirements

Multi-Core Architecture: uAMP

Unsupervised asymmetric multiprocessing (uAMP)

• Same or different copies of an RTOS are running on all cores in an unsupervised AMP environment

• OS and applications do not share computing resources

Core

1

Core

2

AMP

App 1

Core

3

Core

8

App 2 App 8

AMP AMP

App 3

AMP

Best suited for

• Small independent deterministic tasks

Not as suitable for

• Heavy processing tasks

Multi-Core Architecture: Mixed

SMP and uAMP

• An SMP operating system controls the first couple of cores, while the rest of the cores run unsupervised AMP images

• AMP OS instances do not have to be the same

Core

1

Core

2

SMP

App 1

Core

3

Core

8

App 2 App 8

AMP AMP

App 3

Best suited for

• Consolidation that brings mix of tasks into one platform

Provisioning the system

System Resources

• Which CPUs belong to which OS domains

• Where to Map Memory both RAM and Flash

• Interrupts – Which interrupts handled by what cores

• Devices – Which devices are providing connectivity to each OS

Take Full Advantage of Multicore with Multi-OS

12

OS 0

Memory 0

Interrupts

Devices

OS 1

CPU 2

Memory 1

Interrupts

Devices

OS 2

CPU 3

Memory 2

Interrupts

Devices

CPU

0

CPU

1

Inter-Process Communication


13

GPOS RTOS

Memory Pool

Shared Memory

Memory Memory

Send

Receive

Command

Data


• Proprietary

• Roll your own

• Use MCAPI / MRAPI

System Resources

• Interrupt

• Shared Memory



14

GPOS RTOS

Memory Pool

Shared Memory

Memory Memory

Send

Receive

Command

Data

MCAPI (Multicore Communications API)

• Node: CPU, OS or Process/Thread instance

• Endpoint: Connected / Connectionless

• Channel: Scalar or Datagram

MCAPI MCAPI

MCAPI / MRAPI


15

GPOS RTOS

Memory Pool

Shared Memory

Memory Memory

Send

Receive

Command

Data




• Channel: Scalar or Datagram )

Sequence of events

1. Define Topology

- Nodes

- Endpoints

2. Create channels

- Connected

- Connectionless

3. Send/Receive Data

MCAPI MCAPI MRAPI (Multicore

Resource API)

• Shared Memory

• Shared Semaphores

• Interrupts



16

GPOS RTOS

Memory Pool

Shared Memory

Memory Memory

Send

Receive

Command

Data





Sequence of events

1. Define Topology

- Nodes

- Endpoints

2. Create channels

- Connected

- Connectionless


MCAPI MCAPI



17

GPOS RTOS

Memory Pool

Shared Memory

Memory Memory

Send

Receive

Command

Data





Sequence of events

1. Define Topology

- Nodes

- Endpoints

2. Create channels

- Connected

- Connectionless


MCAPI MCAPI

OpenMP


18

OpenMP

• Shared Memory between compute nodes

• Can use Pthreads underneath

PI formula in C – Single Threaded Hello World for Parallel Programming

static long num_steps = 100000;

double step;

int main()

{

int i; double x, pi, sum = 0.0;

step = 1.0/(double) num_steps;

for (i=0;i<num_steps;i++)

{

x = (i+0.5)*step;

sum = sum + 4.0/(1.0+x*x);

}

pi = step * sum;

printf (" PI is %f\n", pi);

}

PI formula in C - OpenMP #include <stdio.h>

#include <omp.h>

static long num_steps = 100000;

double step;

#define PAD 8

static int num_threads = 4;

static long thrd_step;

int main()

{

double pi = 0.0;

double sum = 0.0;

double my_sum[num_threads][PAD];

int j;

double start_time;

double end_time;

step = 1.0/(double) num_steps;

thrd_step = num_steps / num_threads;

omp_set_num_threads(num_threads);

start_time = omp_get_wtime();

#pragma omp parallel

{

int ID = omp_get_thread_num();

int i;

double x;

int startat = ID * thread_step;

for (i = startat; i < startat+thrd_step; i++)

{

x = (i+0.5)*step;

my_sum[ID][0] += 4.0/(1.0+x*x);

}

} // end of pragma

for (j = 0; j < num_threads; j++)

sum += my_sum[j][0];

pi = step * sum;

OpenCL


21

OpenCL – Open compute language

• Khronos Standard – Started with Apple in

• Used with Symmetric cores

• Used with GPGPU (General Purpose GPU)

• Need OpenCL drivers for GPU

PI formula in OpenCL - C int main(void)

{

…

char *kernelsource = getKernelSource("../pi_ocl.cl"); // Kernel source

cl_int err;

cl_device_id device_id; // compute device id

cl_context context; // compute context

cl_command_queue commands; // compute command queue

cl_program program; // compute program

cl_kernel kernel_pi; // compute kernel

// Set up OpenCL context. queue, kernel, etc.

cl_uint numPlatforms; // Find number of platforms

err = clGetPlatformIDs(0, NULL, &numPlatforms);

…

// Get all platforms

cl_platform_id Platform[numPlatforms];

err = clGetPlatformIDs(numPlatforms, Platform, NULL);

…

https://raw.githubusercontent.com/HandsOnOpenCL/Exercises-Solutions/master/Solutions/Exercise09/C/pi_ocl.c

PI formula in OpenCL - C // Secure a device

for (int i = 0; i < numPlatforms; i++)

{

err = clGetDeviceIDs(Platform[i], DEVICE, 1, &device_id, NULL);

}

// Output information

err = output_device_info(device_id); // Create a compute context

context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

…

// Create a command queue

commands = clCreateCommandQueue(context, device_id, 0, &err);

…

// Create the compute program from the source buffer

program = clCreateProgramWithSource(context, 1, (const char **)

& kernelsource, NULL, &err);

// Build the program

err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);


PI formula in OpenCL - C // Create the compute kernel from the program

kernel_pi = clCreateKernel(program, "pi", &err);

// Find kernel work-group size

err = clGetKernelWorkGroupInfo (kernel_pi, device_id,

CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &work_group_size, NULL);

// Now that we know the size of the work-groups, we can set the number of

// work-groups, the actual number of steps, and the step size

nwork_groups = in_nsteps/(work_group_size*niters);

if (nwork_groups < 1)

{ err = clGetDeviceInfo(device_id, CL_DEVICE_MAX_COMPUTE_UNITS,

sizeof(size_t), &nwork_groups, NULL);

work_group_size = in_nsteps / (nwork_groups * niters);

}

nsteps = work_group_size * niters * nwork_groups;

…

d_partial_sums = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) *

nwork_groups, NULL, &err);


PI formula in OpenCL - C // Set kernel arguments

err = clSetKernelArg(kernel_pi, 0, sizeof(int), &niters);

err |= clSetKernelArg(kernel_pi, 1, sizeof(float), &step_size);

err |= clSetKernelArg(kernel_pi, 2, sizeof(float) * work_group_size, NULL);

…

// Execute the kernel over the entire range of our 1D input data set

// using the maximum number of work items for this device

size_t global = nwork_groups * work_group_size;

size_t local = work_group_size;

double rtime = wtime();

err = clEnqueueNDRangeKernel( commands, kernel_pi, 1, NULL,

&global, &local, 0, NULL, NULL); if (err != CL_SUCCESS)

...

err = clEnqueueReadBuffer( commands, d_partial_sums, CL_TRUE,

0, sizeof(float) * nwork_groups, h_psum, 0, NULL, NULL);

…


PI formula in OpenCL - kernel __kernel void pi( const int niters, const float step_size,

__local float* local_sums, __global float* partial_sums)

{

int num_wrk_items = get_local_size(0);

int local_id = get_local_id(0);

int group_id = get_group_id(0);

float x, accum = 0.0f;

int i,istart,iend;

istart = (group_id * num_wrk_items + local_id) * niters;

iend = istart+niters;

for(i= istart; i<iend; i++)

{

x = (i+0.5f)*step_size;

accum += 4.0f/(1.0f+x*x);

}

local_sums[local_id] = accum;

barrier(CLK_LOCAL_MEM_FENCE);

reduce(local_sums, partial_sums);

}

OpenCL Communication


27

RTOS GPU

Memory Pool

Shared Memory

Memory Memory

Send

Receive

Data is presented in shared memory

Kernel of activity is loaded up into the GPU

Kernel activity is loaded

by CPU.

Has local memory

Has Global shared

memory.

High Performance Computer

https://community.freescale.com/docs/DOC-94464

Mini-HPC

• System

• 4 i.MX6 Quad 1.2 Ghz

• Uses the CPU + GPU

• Hardware:

• 4 1.2 GHz Cortex-A9

• 1 Vivante GC2000 GPU

• 1 GB RAM

• 8 GB SD

• 100 Mbit Ethernet via USB

• Software

• Unbuntu 11.10 Linaro Linux

• OpenCL driver: Vivante GC2000

• GCC 4.6.1

• MPI Parallel Compute

• Results

• 100GFLOPS

• 15 Watts

Cilk/Cilk++


29

Cilk/Cilk++

• Shared Memory between compute nodes

• Needs compiler support

Conclusion


30

Compute Intensive Applications are here

• SMP / AMP are very different approaches

• Hybrid may help to optimize system performance

• MCAPI/MRAPI – Good for AMP between OS

instances

• Proprietary – Similar to MCAPI but dependent on

provider.

• OpenMP – easiest to implement. Good or SMP

• OpenCL – High performance Needs tuning

• Cilk/Cilk++ Early for PowerPC/ARM. Stay tuned.

Contact Us

To learn more, visit Wind River at ：http://www.windriver.com

Email: [email protected]

Wind River Sina Weibo，

@Wind River http://weibo.com/windriverchina

Beijing Office Tel：010-84777100

Shanghai Office Tel：021-63585586/87/89/90

Shenzhen Office Tel：0755-25333408/3418/4508/4518

Xi’an Office Tel：029-87607208

Chengdu Office Tel：028-65318000

mailto:[email protected]





http://weibo.com/windriverchina

Documents

How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive