31
How to leverage Multicore Architecture for Compute Intensive Applications FTF-SDS-F0598 Huang Yun Wind River Confidential NDA Disclosure.

How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

Embed Size (px)

Citation preview

Page 1: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

How to leverage Multicore Architecture for Compute Intensive Applications – FTF-SDS-F0598

Huang Yun

Wind River Confidential – NDA Disclosure.

Page 2: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

Agenda

Freescale hardware

– QorIQ

– i.MX6

– SMP/AMP

Multi-core Software Archtectures

– MCAPI/MRAPI

– OpenMP

– OpenCL

– Cilk/Cilk++

– Proprietary

© 2014 Wind River

Page 3: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

QorIQ

© 2014 Wind River

Page 4: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

QorIQ T4240 CPU architecture

12 CPU Cores – e6500 64 bit

– 1.8 GHz at 1V

– Dual-threaded to 24 threads

– Hardware Virtualization

L1 Cache between two threads

L2 Cache shared by a cluster of 4 cores

© 2014 Wind River

Page 5: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

i.MX6 Quad

© 2014 Wind River

Page 6: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

i.MX6 Quad

© 2014 Wind River

4 CPU Cores – ARM Cortex A9 - 32 bit

– 1.2 GHz

L1 32 K I & 32K D Cache per core

L2 1 MB shared cache

Hardware Graphics Accelerator

– OpenGL & OpenCL capable

Page 7: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

Maximizing Multi-core Benefits

The potential of multi-core platforms to deliver increased performance with less power consumption is not a guaranteed outcome.

Successfully mapping your single-core application to multi-core architectures is a journey challenged by what you don’t know…

All operating environments are not created equal when it comes to configuration options yielding maximum performance for your specific applications on your chosen multi-core platform

Page 8: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

Single to Multi-Core

Core

1

OS

App 1

Single-Core

Multi-Core

Sub 1

Core

1

OS

App 2

Sub 2

Core

1

OS

App 3

Sub 3

Core

1

Core

2

OS

App 1

Core

3

Core

8

App 2 App 8

Page 9: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

Multi-Core Architecture: SMP

Symmetric multiprocessing (SMP)

• Many computing resources for OS and applications to share

• Single RTOS and scheduler

• Priority based assumptions might cause timing issues

Core

1

Core

2

SMP

App 1

Core

3

Core

8

App 2 App 8

Best suited for

• Heavy processing tasks such as data manipulation and image processing

Not as suitable for

• Hard real-time response requirements

Page 10: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

Multi-Core Architecture: uAMP

Unsupervised asymmetric multiprocessing (uAMP)

• Same or different copies of an RTOS are running on all cores in an unsupervised AMP environment

• OS and applications do not share computing resources

Core

1

Core

2

AMP

App 1

Core

3

Core

8

App 2 App 8

AMP AMP

App 3

AMP

Best suited for

• Small independent deterministic tasks

Not as suitable for

• Heavy processing tasks

Page 11: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

Multi-Core Architecture: Mixed

SMP and uAMP

• An SMP operating system controls the first couple of cores, while the rest of the cores run unsupervised AMP images

• AMP OS instances do not have to be the same

Core

1

Core

2

SMP

App 1

Core

3

Core

8

App 2 App 8

AMP AMP

App 3

Best suited for

• Consolidation that brings mix of tasks into one platform

Page 12: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

Provisioning the system

System Resources

• Which CPUs belong to which OS domains

• Where to Map Memory both RAM and Flash

• Interrupts – Which interrupts handled by what cores

• Devices – Which devices are providing connectivity to each OS

Take Full Advantage of Multicore with Multi-OS

12

OS 0

Memory 0

Interrupts

Devices

OS 1

CPU 2

Memory 1

Interrupts

Devices

OS 2

CPU 3

Memory 2

Interrupts

Devices

CPU

0

CPU

1

Page 13: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

Inter-Process Communication

Take Full Advantage of Multicore with Multi-OS

13

GPOS RTOS

Memory Pool

Shared Memory

Memory Memory

Send

Receive

Command

Data

Inter-Process Communication

• Proprietary

• Roll your own

• Use MCAPI / MRAPI

System Resources

• Interrupt

• Shared Memory

Page 14: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

Inter-Process Communication

Take Full Advantage of Multicore with Multi-OS

14

GPOS RTOS

Memory Pool

Shared Memory

Memory Memory

Send

Receive

Command

Data

MCAPI (Multicore Communications API)

• Node: CPU, OS or Process/Thread instance

• Endpoint: Connected / Connectionless

• Channel: Scalar or Datagram

MCAPI MCAPI

Page 15: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

MCAPI / MRAPI

Take Full Advantage of Multicore with Multi-OS

15

GPOS RTOS

Memory Pool

Shared Memory

Memory Memory

Send

Receive

Command

Data

MCAPI (Multicore Communications API)

• Node: CPU, OS or Process/Thread instance

• Endpoint: Connected / Connectionless

• Channel: Scalar or Datagram )

Sequence of events

1. Define Topology

- Nodes

- Endpoints

2. Create channels

- Connected

- Connectionless

3. Send/Receive Data

MCAPI MCAPI MRAPI (Multicore

Resource API)

• Shared Memory

• Shared Semaphores

• Interrupts

Page 16: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

Inter-Process Communication

Take Full Advantage of Multicore with Multi-OS

16

GPOS RTOS

Memory Pool

Shared Memory

Memory Memory

Send

Receive

Command

Data

MCAPI (Multicore Communications API)

• Node: CPU, OS or Process/Thread instance

• Endpoint: Connected / Connectionless

• Channel: Scalar or Datagram )

Sequence of events

1. Define Topology

- Nodes

- Endpoints

2. Create channels

- Connected

- Connectionless

3. Send/Receive Data

MCAPI MCAPI

Page 17: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

Inter-Process Communication

Take Full Advantage of Multicore with Multi-OS

17

GPOS RTOS

Memory Pool

Shared Memory

Memory Memory

Send

Receive

Command

Data

MCAPI (Multicore Communications API)

• Node: CPU, OS or Process/Thread instance

• Endpoint: Connected / Connectionless

• Channel: Scalar or Datagram )

Sequence of events

1. Define Topology

- Nodes

- Endpoints

2. Create channels

- Connected

- Connectionless

3. Send/Receive Data

MCAPI MCAPI

Page 18: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

OpenMP

Take Full Advantage of Multicore with Multi-OS

18

OpenMP

• Shared Memory between compute nodes

• Can use Pthreads underneath

Page 19: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

PI formula in C – Single Threaded Hello World for Parallel Programming

static long num_steps = 100000;

double step;

int main()

{

int i; double x, pi, sum = 0.0;

step = 1.0/(double) num_steps;

for (i=0;i<num_steps;i++)

{

x = (i+0.5)*step;

sum = sum + 4.0/(1.0+x*x);

}

pi = step * sum;

printf (" PI is %f\n", pi);

}

Page 20: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

PI formula in C - OpenMP #include <stdio.h>

#include <omp.h>

static long num_steps = 100000;

double step;

#define PAD 8

static int num_threads = 4;

static long thrd_step;

int main()

{

double pi = 0.0;

double sum = 0.0;

double my_sum[num_threads][PAD];

int j;

double start_time;

double end_time;

step = 1.0/(double) num_steps;

thrd_step = num_steps / num_threads;

omp_set_num_threads(num_threads);

start_time = omp_get_wtime();

#pragma omp parallel

{

int ID = omp_get_thread_num();

int i;

double x;

int startat = ID * thread_step;

for (i = startat; i < startat+thrd_step; i++)

{

x = (i+0.5)*step;

my_sum[ID][0] += 4.0/(1.0+x*x);

}

} // end of pragma

for (j = 0; j < num_threads; j++)

sum += my_sum[j][0];

pi = step * sum;

Page 21: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

OpenCL

Take Full Advantage of Multicore with Multi-OS

21

OpenCL – Open compute language

• Khronos Standard – Started with Apple in

• Used with Symmetric cores

• Used with GPGPU (General Purpose GPU)

• Need OpenCL drivers for GPU

Page 22: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

PI formula in OpenCL - C int main(void)

{

char *kernelsource = getKernelSource("../pi_ocl.cl"); // Kernel source

cl_int err;

cl_device_id device_id; // compute device id

cl_context context; // compute context

cl_command_queue commands; // compute command queue

cl_program program; // compute program

cl_kernel kernel_pi; // compute kernel

// Set up OpenCL context. queue, kernel, etc.

cl_uint numPlatforms; // Find number of platforms

err = clGetPlatformIDs(0, NULL, &numPlatforms);

// Get all platforms

cl_platform_id Platform[numPlatforms];

err = clGetPlatformIDs(numPlatforms, Platform, NULL);

https://raw.githubusercontent.com/HandsOnOpenCL/Exercises-Solutions/master/Solutions/Exercise09/C/pi_ocl.c

Page 23: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

PI formula in OpenCL - C // Secure a device

for (int i = 0; i < numPlatforms; i++)

{

err = clGetDeviceIDs(Platform[i], DEVICE, 1, &device_id, NULL);

}

// Output information

err = output_device_info(device_id); // Create a compute context

context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);

// Create a command queue

commands = clCreateCommandQueue(context, device_id, 0, &err);

// Create the compute program from the source buffer

program = clCreateProgramWithSource(context, 1, (const char **)

& kernelsource, NULL, &err);

// Build the program

err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);

https://raw.githubusercontent.com/HandsOnOpenCL/Exercises-Solutions/master/Solutions/Exercise09/C/pi_ocl.c

Page 24: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

PI formula in OpenCL - C // Create the compute kernel from the program

kernel_pi = clCreateKernel(program, "pi", &err);

// Find kernel work-group size

err = clGetKernelWorkGroupInfo (kernel_pi, device_id,

CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &work_group_size, NULL);

// Now that we know the size of the work-groups, we can set the number of

// work-groups, the actual number of steps, and the step size

nwork_groups = in_nsteps/(work_group_size*niters);

if (nwork_groups < 1)

{ err = clGetDeviceInfo(device_id, CL_DEVICE_MAX_COMPUTE_UNITS,

sizeof(size_t), &nwork_groups, NULL);

work_group_size = in_nsteps / (nwork_groups * niters);

}

nsteps = work_group_size * niters * nwork_groups;

d_partial_sums = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) *

nwork_groups, NULL, &err);

https://raw.githubusercontent.com/HandsOnOpenCL/Exercises-Solutions/master/Solutions/Exercise09/C/pi_ocl.c

Page 25: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

PI formula in OpenCL - C // Set kernel arguments

err = clSetKernelArg(kernel_pi, 0, sizeof(int), &niters);

err |= clSetKernelArg(kernel_pi, 1, sizeof(float), &step_size);

err |= clSetKernelArg(kernel_pi, 2, sizeof(float) * work_group_size, NULL);

// Execute the kernel over the entire range of our 1D input data set

// using the maximum number of work items for this device

size_t global = nwork_groups * work_group_size;

size_t local = work_group_size;

double rtime = wtime();

err = clEnqueueNDRangeKernel( commands, kernel_pi, 1, NULL,

&global, &local, 0, NULL, NULL); if (err != CL_SUCCESS)

...

err = clEnqueueReadBuffer( commands, d_partial_sums, CL_TRUE,

0, sizeof(float) * nwork_groups, h_psum, 0, NULL, NULL);

https://raw.githubusercontent.com/HandsOnOpenCL/Exercises-Solutions/master/Solutions/Exercise09/C/pi_ocl.c

Page 26: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

PI formula in OpenCL - kernel __kernel void pi( const int niters, const float step_size,

__local float* local_sums, __global float* partial_sums)

{

int num_wrk_items = get_local_size(0);

int local_id = get_local_id(0);

int group_id = get_group_id(0);

float x, accum = 0.0f;

int i,istart,iend;

istart = (group_id * num_wrk_items + local_id) * niters;

iend = istart+niters;

for(i= istart; i<iend; i++)

{

x = (i+0.5f)*step_size;

accum += 4.0f/(1.0f+x*x);

}

local_sums[local_id] = accum;

barrier(CLK_LOCAL_MEM_FENCE);

reduce(local_sums, partial_sums);

}

Page 27: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

OpenCL Communication

Take Full Advantage of Multicore with Multi-OS

27

RTOS GPU

Memory Pool

Shared Memory

Memory Memory

Send

Receive

Data is presented in shared memory

Kernel of activity is loaded up into the GPU

Kernel activity is loaded

by CPU.

Has local memory

Has Global shared

memory.

Page 28: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

High Performance Computer

https://community.freescale.com/docs/DOC-94464

Mini-HPC

• System

• 4 i.MX6 Quad 1.2 Ghz

• Uses the CPU + GPU

• Hardware:

• 4 1.2 GHz Cortex-A9

• 1 Vivante GC2000 GPU

• 1 GB RAM

• 8 GB SD

• 100 Mbit Ethernet via USB

• Software

• Unbuntu 11.10 Linaro Linux

• OpenCL driver: Vivante GC2000

• GCC 4.6.1

• MPI Parallel Compute

• Results

• 100GFLOPS

• 15 Watts

Page 29: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

Cilk/Cilk++

Take Full Advantage of Multicore with Multi-OS

29

Cilk/Cilk++

• Shared Memory between compute nodes

• Needs compiler support

Page 30: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

Conclusion

Take Full Advantage of Multicore with Multi-OS

30

Compute Intensive Applications are here

• SMP / AMP are very different approaches

• Hybrid may help to optimize system performance

• MCAPI/MRAPI – Good for AMP between OS

instances

• Proprietary – Similar to MCAPI but dependent on

provider.

• OpenMP – easiest to implement. Good or SMP

• OpenCL – High performance Needs tuning

• Cilk/Cilk++ Early for PowerPC/ARM. Stay tuned.

Page 31: How to leverage Multicore Architecture for Compute ...cache.freescale.com/files/training/doc/ftf/2014/FTF-SDS-F0598.pdf · How to leverage Multicore Architecture for Compute Intensive

Contact Us

To learn more, visit Wind River at :http://www.windriver.com

Email: [email protected]

Wind River Sina Weibo,

@Wind River http://weibo.com/windriverchina

Beijing Office Tel:010-84777100

Shanghai Office Tel:021-63585586/87/89/90

Shenzhen Office Tel:0755-25333408/3418/4508/4518

Xi’an Office Tel:029-87607208

Chengdu Office Tel:028-65318000