Upload
hoangtruc
View
238
Download
3
Embed Size (px)
Citation preview
How to leverage Multicore Architecture for Compute Intensive Applications – FTF-SDS-F0598
Huang Yun
Wind River Confidential – NDA Disclosure.
Agenda
Freescale hardware
– QorIQ
– i.MX6
– SMP/AMP
Multi-core Software Archtectures
– MCAPI/MRAPI
– OpenMP
– OpenCL
– Cilk/Cilk++
– Proprietary
© 2014 Wind River
QorIQ
© 2014 Wind River
QorIQ T4240 CPU architecture
12 CPU Cores – e6500 64 bit
– 1.8 GHz at 1V
– Dual-threaded to 24 threads
– Hardware Virtualization
L1 Cache between two threads
L2 Cache shared by a cluster of 4 cores
© 2014 Wind River
i.MX6 Quad
© 2014 Wind River
i.MX6 Quad
© 2014 Wind River
4 CPU Cores – ARM Cortex A9 - 32 bit
– 1.2 GHz
L1 32 K I & 32K D Cache per core
L2 1 MB shared cache
Hardware Graphics Accelerator
– OpenGL & OpenCL capable
Maximizing Multi-core Benefits
The potential of multi-core platforms to deliver increased performance with less power consumption is not a guaranteed outcome.
Successfully mapping your single-core application to multi-core architectures is a journey challenged by what you don’t know…
All operating environments are not created equal when it comes to configuration options yielding maximum performance for your specific applications on your chosen multi-core platform
Single to Multi-Core
Core
1
OS
App 1
Single-Core
Multi-Core
Sub 1
Core
1
OS
App 2
Sub 2
Core
1
OS
App 3
Sub 3
Core
1
Core
2
OS
App 1
Core
3
Core
8
App 2 App 8
Multi-Core Architecture: SMP
Symmetric multiprocessing (SMP)
• Many computing resources for OS and applications to share
• Single RTOS and scheduler
• Priority based assumptions might cause timing issues
Core
1
Core
2
SMP
App 1
Core
3
Core
8
App 2 App 8
Best suited for
• Heavy processing tasks such as data manipulation and image processing
Not as suitable for
• Hard real-time response requirements
Multi-Core Architecture: uAMP
Unsupervised asymmetric multiprocessing (uAMP)
• Same or different copies of an RTOS are running on all cores in an unsupervised AMP environment
• OS and applications do not share computing resources
Core
1
Core
2
AMP
App 1
Core
3
Core
8
App 2 App 8
AMP AMP
App 3
AMP
Best suited for
• Small independent deterministic tasks
Not as suitable for
• Heavy processing tasks
Multi-Core Architecture: Mixed
SMP and uAMP
• An SMP operating system controls the first couple of cores, while the rest of the cores run unsupervised AMP images
• AMP OS instances do not have to be the same
Core
1
Core
2
SMP
App 1
Core
3
Core
8
App 2 App 8
AMP AMP
App 3
Best suited for
• Consolidation that brings mix of tasks into one platform
Provisioning the system
System Resources
• Which CPUs belong to which OS domains
• Where to Map Memory both RAM and Flash
• Interrupts – Which interrupts handled by what cores
• Devices – Which devices are providing connectivity to each OS
Take Full Advantage of Multicore with Multi-OS
12
OS 0
Memory 0
Interrupts
Devices
OS 1
CPU 2
Memory 1
Interrupts
Devices
OS 2
CPU 3
Memory 2
Interrupts
Devices
CPU
0
CPU
1
Inter-Process Communication
Take Full Advantage of Multicore with Multi-OS
13
GPOS RTOS
Memory Pool
Shared Memory
Memory Memory
Send
Receive
Command
Data
Inter-Process Communication
• Proprietary
• Roll your own
• Use MCAPI / MRAPI
System Resources
• Interrupt
• Shared Memory
Inter-Process Communication
Take Full Advantage of Multicore with Multi-OS
14
GPOS RTOS
Memory Pool
Shared Memory
Memory Memory
Send
Receive
Command
Data
MCAPI (Multicore Communications API)
• Node: CPU, OS or Process/Thread instance
• Endpoint: Connected / Connectionless
• Channel: Scalar or Datagram
MCAPI MCAPI
MCAPI / MRAPI
Take Full Advantage of Multicore with Multi-OS
15
GPOS RTOS
Memory Pool
Shared Memory
Memory Memory
Send
Receive
Command
Data
MCAPI (Multicore Communications API)
• Node: CPU, OS or Process/Thread instance
• Endpoint: Connected / Connectionless
• Channel: Scalar or Datagram )
Sequence of events
1. Define Topology
- Nodes
- Endpoints
2. Create channels
- Connected
- Connectionless
3. Send/Receive Data
MCAPI MCAPI MRAPI (Multicore
Resource API)
• Shared Memory
• Shared Semaphores
• Interrupts
Inter-Process Communication
Take Full Advantage of Multicore with Multi-OS
16
GPOS RTOS
Memory Pool
Shared Memory
Memory Memory
Send
Receive
Command
Data
MCAPI (Multicore Communications API)
• Node: CPU, OS or Process/Thread instance
• Endpoint: Connected / Connectionless
• Channel: Scalar or Datagram )
Sequence of events
1. Define Topology
- Nodes
- Endpoints
2. Create channels
- Connected
- Connectionless
3. Send/Receive Data
MCAPI MCAPI
Inter-Process Communication
Take Full Advantage of Multicore with Multi-OS
17
GPOS RTOS
Memory Pool
Shared Memory
Memory Memory
Send
Receive
Command
Data
MCAPI (Multicore Communications API)
• Node: CPU, OS or Process/Thread instance
• Endpoint: Connected / Connectionless
• Channel: Scalar or Datagram )
Sequence of events
1. Define Topology
- Nodes
- Endpoints
2. Create channels
- Connected
- Connectionless
3. Send/Receive Data
MCAPI MCAPI
OpenMP
Take Full Advantage of Multicore with Multi-OS
18
OpenMP
• Shared Memory between compute nodes
• Can use Pthreads underneath
PI formula in C – Single Threaded Hello World for Parallel Programming
static long num_steps = 100000;
double step;
int main()
{
int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
for (i=0;i<num_steps;i++)
{
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
printf (" PI is %f\n", pi);
}
PI formula in C - OpenMP #include <stdio.h>
#include <omp.h>
static long num_steps = 100000;
double step;
#define PAD 8
static int num_threads = 4;
static long thrd_step;
int main()
{
double pi = 0.0;
double sum = 0.0;
double my_sum[num_threads][PAD];
int j;
double start_time;
double end_time;
step = 1.0/(double) num_steps;
thrd_step = num_steps / num_threads;
omp_set_num_threads(num_threads);
start_time = omp_get_wtime();
#pragma omp parallel
{
int ID = omp_get_thread_num();
int i;
double x;
int startat = ID * thread_step;
for (i = startat; i < startat+thrd_step; i++)
{
x = (i+0.5)*step;
my_sum[ID][0] += 4.0/(1.0+x*x);
}
} // end of pragma
for (j = 0; j < num_threads; j++)
sum += my_sum[j][0];
pi = step * sum;
OpenCL
Take Full Advantage of Multicore with Multi-OS
21
OpenCL – Open compute language
• Khronos Standard – Started with Apple in
• Used with Symmetric cores
• Used with GPGPU (General Purpose GPU)
• Need OpenCL drivers for GPU
PI formula in OpenCL - C int main(void)
{
…
char *kernelsource = getKernelSource("../pi_ocl.cl"); // Kernel source
cl_int err;
cl_device_id device_id; // compute device id
cl_context context; // compute context
cl_command_queue commands; // compute command queue
cl_program program; // compute program
cl_kernel kernel_pi; // compute kernel
// Set up OpenCL context. queue, kernel, etc.
cl_uint numPlatforms; // Find number of platforms
err = clGetPlatformIDs(0, NULL, &numPlatforms);
…
// Get all platforms
cl_platform_id Platform[numPlatforms];
err = clGetPlatformIDs(numPlatforms, Platform, NULL);
…
https://raw.githubusercontent.com/HandsOnOpenCL/Exercises-Solutions/master/Solutions/Exercise09/C/pi_ocl.c
PI formula in OpenCL - C // Secure a device
for (int i = 0; i < numPlatforms; i++)
{
err = clGetDeviceIDs(Platform[i], DEVICE, 1, &device_id, NULL);
}
// Output information
err = output_device_info(device_id); // Create a compute context
context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
…
// Create a command queue
commands = clCreateCommandQueue(context, device_id, 0, &err);
…
// Create the compute program from the source buffer
program = clCreateProgramWithSource(context, 1, (const char **)
& kernelsource, NULL, &err);
// Build the program
err = clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
https://raw.githubusercontent.com/HandsOnOpenCL/Exercises-Solutions/master/Solutions/Exercise09/C/pi_ocl.c
PI formula in OpenCL - C // Create the compute kernel from the program
kernel_pi = clCreateKernel(program, "pi", &err);
// Find kernel work-group size
err = clGetKernelWorkGroupInfo (kernel_pi, device_id,
CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &work_group_size, NULL);
// Now that we know the size of the work-groups, we can set the number of
// work-groups, the actual number of steps, and the step size
nwork_groups = in_nsteps/(work_group_size*niters);
if (nwork_groups < 1)
{ err = clGetDeviceInfo(device_id, CL_DEVICE_MAX_COMPUTE_UNITS,
sizeof(size_t), &nwork_groups, NULL);
work_group_size = in_nsteps / (nwork_groups * niters);
}
nsteps = work_group_size * niters * nwork_groups;
…
d_partial_sums = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) *
nwork_groups, NULL, &err);
https://raw.githubusercontent.com/HandsOnOpenCL/Exercises-Solutions/master/Solutions/Exercise09/C/pi_ocl.c
PI formula in OpenCL - C // Set kernel arguments
err = clSetKernelArg(kernel_pi, 0, sizeof(int), &niters);
err |= clSetKernelArg(kernel_pi, 1, sizeof(float), &step_size);
err |= clSetKernelArg(kernel_pi, 2, sizeof(float) * work_group_size, NULL);
…
// Execute the kernel over the entire range of our 1D input data set
// using the maximum number of work items for this device
size_t global = nwork_groups * work_group_size;
size_t local = work_group_size;
double rtime = wtime();
err = clEnqueueNDRangeKernel( commands, kernel_pi, 1, NULL,
&global, &local, 0, NULL, NULL); if (err != CL_SUCCESS)
...
err = clEnqueueReadBuffer( commands, d_partial_sums, CL_TRUE,
0, sizeof(float) * nwork_groups, h_psum, 0, NULL, NULL);
…
https://raw.githubusercontent.com/HandsOnOpenCL/Exercises-Solutions/master/Solutions/Exercise09/C/pi_ocl.c
PI formula in OpenCL - kernel __kernel void pi( const int niters, const float step_size,
__local float* local_sums, __global float* partial_sums)
{
int num_wrk_items = get_local_size(0);
int local_id = get_local_id(0);
int group_id = get_group_id(0);
float x, accum = 0.0f;
int i,istart,iend;
istart = (group_id * num_wrk_items + local_id) * niters;
iend = istart+niters;
for(i= istart; i<iend; i++)
{
x = (i+0.5f)*step_size;
accum += 4.0f/(1.0f+x*x);
}
local_sums[local_id] = accum;
barrier(CLK_LOCAL_MEM_FENCE);
reduce(local_sums, partial_sums);
}
OpenCL Communication
Take Full Advantage of Multicore with Multi-OS
27
RTOS GPU
Memory Pool
Shared Memory
Memory Memory
Send
Receive
Data is presented in shared memory
Kernel of activity is loaded up into the GPU
Kernel activity is loaded
by CPU.
Has local memory
Has Global shared
memory.
High Performance Computer
https://community.freescale.com/docs/DOC-94464
Mini-HPC
• System
• 4 i.MX6 Quad 1.2 Ghz
• Uses the CPU + GPU
• Hardware:
• 4 1.2 GHz Cortex-A9
• 1 Vivante GC2000 GPU
• 1 GB RAM
• 8 GB SD
• 100 Mbit Ethernet via USB
• Software
• Unbuntu 11.10 Linaro Linux
• OpenCL driver: Vivante GC2000
• GCC 4.6.1
• MPI Parallel Compute
• Results
• 100GFLOPS
• 15 Watts
Cilk/Cilk++
Take Full Advantage of Multicore with Multi-OS
29
Cilk/Cilk++
• Shared Memory between compute nodes
• Needs compiler support
Conclusion
Take Full Advantage of Multicore with Multi-OS
30
Compute Intensive Applications are here
• SMP / AMP are very different approaches
• Hybrid may help to optimize system performance
• MCAPI/MRAPI – Good for AMP between OS
instances
• Proprietary – Similar to MCAPI but dependent on
provider.
• OpenMP – easiest to implement. Good or SMP
• OpenCL – High performance Needs tuning
• Cilk/Cilk++ Early for PowerPC/ARM. Stay tuned.
Contact Us
To learn more, visit Wind River at :http://www.windriver.com
Email: [email protected]
Wind River Sina Weibo,
@Wind River http://weibo.com/windriverchina
Beijing Office Tel:010-84777100
Shanghai Office Tel:021-63585586/87/89/90
Shenzhen Office Tel:0755-25333408/3418/4508/4518
Xi’an Office Tel:029-87607208
Chengdu Office Tel:028-65318000