Introduction to OpenCL*
Ohad Shacham
Intel Software and Services Group
Thanks to Elior Malul, Arik Narkis, and Doron Singer 1
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Evolution of OpenCL*
2
Sequential Programs
void scalar_mul(int n, const float *a, const float *b, float *c){ int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i];}
int main(){ //read input scalar_mul(…) return 0;}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Evolution of OpenCL*
Multi-threaded Programs
void scalar_mul(int n, const float *a,
const float *b, float *c){ int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i];}
int main(){ //read input pthread_start(…, scalar_mul); scalar_mul(n/2, …); pthread_join(…); return 0;}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Problems – concurrent programs
• Writing concurrent programs is hard
• Concurrent algorithms
• Threads
• Work balancing• Need to update programs when adding new cores to the system
• Dataraces, livelocks, deadlocks• Solving bugs in concurrent programs is harder
4
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Evolution of OpenCL*
5
Vector instruction utilization
void scalar_mul(int n, const float *a, const float *b, float *c){ int i; for (i = 0; i < n; i+=4){ __m128 a_vec = _mm_load_ps(a+i); __m128 b_vec = _mm_load_ps(b+i); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + i, c_vec); }}
int main(){ //read input scalar_mul(…) return 0;}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Problems – vector instructions usage
• Utilizing vector instructions in also not a trivial task
• Vendor dependent code
• Usage is not future proof• New efficient instruction• Wider vector registers
6
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
7
GPGPUGPGPU stands for General-Purpose computation on Graphics Processing Units (GPUs). GPUs are high-performance many-core processors that can be used to accelerate a wide range of applications
(www.gpgpu.org)
Photo taken from: http://folding.stanford.edu/English/FAQ-NVIDIA
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
GPUs utilization
• Many cores can be utilized for computation
• GPUs become programmable - GPGPU• CUDA*
• Problems• Each vendor has its own language• Requires tweaking to get performance• How can I run both on CPUs and GPUs?
8
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
What do we need?
• Heterogeneous• Automatically utilizes all available processing units• Portable
• High Performance• Utilize Hardware characteristics
• Future Proof
• Abstract concurrency from the user
9
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
OpenCL* – heterogeneous computing
10
Diagram based on deck presented in OpenCL* BOF at SIGGRAPH 2010 by Neil Trevett, NVIDIA, OpenCL* Chair
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
OpenCL* in a nutshell
An OpenCL* application consists two parts:
• A set of APIs in C that allows compiling and running OpenCL* “Kernels”
• A code that is executed on the device by the OpenCL* runtime
11
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Data parallelism
12
A fundamental pattern in high-performance parallel algorithms
Applying same computation logic across multiple data elements
C[i] = A[i] * B[i]
i = 0
i = i + 1
C[i] = A[i] * B[i]
C[i] = A[i] * B[i]
C[i] = A[i] * B[i]
C[i] = A[i] * B[i]
C[i] = A[i] * B[i]
C[i] = A[i] * B[i]
i = 0
i = 1
i = 2
i = 3
i = N-2
i = N-1
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
13
Data parallelism UsageClient machines• Video transcoding and editing• Pro image editing• Facial recognition
Workstations• CAD tools• 3D data content creation
Servers• Science and simulations• Medical imaging• Oil & Gas• Finance (e.g., Black-Scholes)• …
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
14
OpenCL* kernel example
void array_mul(int n, const float *a, const float *b, float *c){ int i; for (i = 0; i < n; i++) c[i] = a[i] * b[i];}
__kernelvoid array_mul( __global const float *a, __global const float *b, __global float *c){ int id = get_global_id(0); c[id] = a[id] * b[id];}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
15
OpenCL* kernel example __kernelvoid array_mul(__global const float *a, __global const float *b, __global float *c){ int id = get_global_id(0); c[id] = a[id] * b[id];}
a
b
c
get_global_id(0)
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
16
Execution Model
Work GroupWork GroupWork Group Work Group
Work Item
Global
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
The OpenCL* model• OpenCL* runtime is invoked on Host CPU (using OpenCL* API)
– Choose target device/s for parallel computation
• Data-parallel functions, called Kernels, are compiled (on host)
• Compiled for specific target devices (CPU, GPU, etc..)
• Data chunks (called Buffers) are moved across devices
• Kernel “commands” queued for execution on target devices– Asynchronous execution
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
18
The OpenCL* - C language• Derived from ISO C99
• Few restrictions e.g., recursion, function pointers
• Short vector types e.g., float4, short2, int16
• Built-in functions – math (e.g., sin), geometric, common (e.g., min, clamp)
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Unified programming model for all devices• Develop once, run everywhere
Designed for massive data-parallelism• Implicitly takes care of threading and intrinsics for optimal
performance
19
OpenCL* key features
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Dynamic compilation model (Just In Time - JIT) • Future proof, provided vendors update their implementations
Enables heterogeneous computing• A clever application can use all resources of the platform
simultaneously
20
OpenCL* key features
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Benefits to User
• Hardware abstraction• write once, run everywhere• Cross devices, cross vendors
• Automatic parallelization
• Good tradeoff between development simplicity and performance
• Future proof optimizations
• Open standard• Supported by many vendors
21
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Benefits to Hardware Vendor
• Enables good hardware ‘time to market’
• Programming model enables good hardware utilization
• Applications are automatically portable and future proof– JIT compilation
22
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
OpenCL* Cons
• Low level – based on C99 • No heap!• Lean framework
• Expert tool• In term of correctness and performance
• OpenCL* is not performance portable• Tweaking is needed for each vendor• Future specs and implementations may require no tweaking?
23
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Vector dot multiplication
24
void vectorDotMul(int* vecA, int* vecB, int size, int* result){ *result = 0; for (int i=0; i < size; ++i) *result += vecA[i] * vecB[i];}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
25
111111
222222
11
22
Single work item
* = 2* = 24* = 26* = 28* = 210* = 212* = 21214* = 216
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Vector dot multiplication in OpenCL*
26
__kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result) { if (get_global_id(0) == 0){ *result = 0; for (int i=0; i<size; ++i) *result += vecA[i] * vecB[i]; }}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
27
11
11
11
22
22
22
11
22
Single work group
* = 2* = 24
* = 2
* = 2
* = 2* = 2
* = 2* = 2
4
4
4
8
12
16
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
28
__kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result){ int id = get_local_id(0); __local volatile int partialSum[MAX_SIZE]; int localSize = get_local_size(0); int work = size/localSize; int start = id*work; int end = start+work; for (int j=start; j<end; ++j) partialSum[id] += vecA[j] * vecB[j]; barrier(CLK_LOCAL_MEM_FENCE); if (id == 0) *result = 0; for (int i=0; i<localSize; ++i) *result += partialSum[i];}
Work item calculation
Reduction
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
29
11
11
11
22
22
22
11
22
Efficient reduction
* = 2* = 24
* = 2
* = 2
* = 2* = 2
* = 2* = 2
4
4
4
8
4
816
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Vectorization
• Processors provide vector units• SIMD on CPUs• Warp on GPUs
• Utilize to perform few operations in parallel– Arithmetic operations– Binary operations – Memory operation
30
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Loop vectorization
31
void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; ++i) { c[i] = a[i] * b[i]; }}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Loop vectorization
32
void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { c[i] = a[i] * b[i]; c[i+1] = a[i+1] * b[i+1]; c[i+2] = a[i+2] * b[i+2]; c[i+3] = a[i+3] * b[i+3]; }}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Loop vectorization
33
void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { __m128 a_vec = _mm_load_ps(a + i); __m128 b_vec = _mm_load_ps(b + i); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + i, c_vec); }}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Automatic loop vectorization
34
Is there dependency between a, b, and c?
void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; ++i) { c[i] = a[i] * b[i]; }}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Automatic loop vectorization
35
cb
void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; ++i) { c[i] = a[i] * b[i]; }}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Automatic loop vectorization
36
cb
void mul(int size, int* a, int* b, int* c) { for (int i=0; i < size; i += 4) { c[i] = a[i] * b[i]; c[i+1] = a[i+1] * b[i+1]; c[i+2] = a[i+2] * b[i+2]; c[i+3] = a[i+3] * b[i+3]; }}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Automatic vectorization in OpenCL*
37
__kernel void mul(int size, int* a, int* b, int* c) { int id = get_global_id(0); c[id] = a[id] * b[id];}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Automatic vectorization in OpenCL*
38
for (int id=workGroupIdStart; id < workGroupIdEnd; ++id) { c[id] = a[id] * b[id];}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Automatic vectorization in OpenCL*
39
for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { c[id] = a[id] * b[id]; c[id+1] = a[id+1] * b[id+1]; c[id+2] = a[id+2] * b[id+2]; c[id+3] = a[id+3] * b[id+3];}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Automatic vectorization in OpenCL*
40
for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { __m128 a_vec = _mm_load_ps(a + id); __m128 b_vec = _mm_load_ps(b + id); __m128 c_vec = _mm_mul_ps(a_vec, b_vec); __mm_store_ps(c + id, c_vec);}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
41
11
11
11
22
22
22
11
22
Single work group
* = 2* = 24
* = 2
* = 2
* = 2* = 2
* = 2* = 2
4
4
4
8
4
816
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
42
1
1
1
1
1
1
2
2
2
2
2
2
1
1
2
2
Vectorizer friendly
* = 2
* = 24
* = 2
* = 2
* = 2
* = 2
* = 2
* = 2
444
84816
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
43
__kernel void vectorDotMul(int* vecA, int* vecB, int size, int* result){ int id = get_local_id(0); __local volatile int partialSum[MAX_SIZE]; int localSize = get_local_size(0); int work = size/localSize;
for (int j=start; j < cols; j + = size) partialSum[id] += vecA[j] * vecB[j];
barrier(CLK_LOCAL_MEM_FENCE); if (id == 0) *result = 0; for (int i=0; i<localSize; ++i) *result += partialSum[i];}
Work item calculation
Reduction
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Predication
44
__kernel void mul(int size, int* a, int* b, int* c) { int id = get_global_id(0); if(id > 6) { c[id] = a[id] * b[id]; } else { c[id] = a[id] + b[id]; }}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Predication
45
for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { if(id > 6) { c[id] = a[id] * b[id]; } else { c[id] = a[id] + b[id]; }}
How can we vectorize the loop?
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Predication
46
for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { bool mask = (id > 6); int c1 = a[id] * b[id]; int c2 = a[id] + b[id];
c[id] = (mask) ? c1 : c2;}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Predication
47
for (int id=workGroupIdStart; id < workGroupIdEnd; id +=4) { __m128 idVec = // vector of consecutive ids __m128 mask = _mm_cmpgt_epi32(idVec, Vec6); __m128 a_vec = _mm_load_ps(a + id); __m128 b_vec = _mm_load_ps(b + id);
__m128 c1_vec = _mm_mul_ps(a_vec, b_vec); __m128 c2_vec = _mm_add_ps(a_vec, b_vec); __m128 c3_vec = _mm_blendv_ps(c1_vec, c2_vec, mask);
__mm_store_ps(c + id, c3_vec);}
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
General tweaking
• Consecutive memory accesses• SIMD, WARP
• How can we vectorize with control flow?
• Can we somehow create an efficient code with control flow?• Uniform CF• CF diverge in SIMD size
• Enough work groups to utilize machine
48
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Architecture tweaking
CPU• Locality• No local memory (also slow in some GPUs)• Enough compute for a work group• Overcome thread creation overhead
GPU• Use local memory• Avoid bank conflicts
49
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
Conclusion
• OpenCL* is an open standard that lets developers:– Write the same code for any type of processor
• Use all existing resources of a platform in their application
• Automatic parallelism
• OpenCL* applications are automatically portable and forward compatible
• OpenCL* is still an expert tool– OpenCL* is not performance portable– Tweaking for each vendor should be done
50
Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.
Copyright © , Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries.
Optimization NoticeIntel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
Legal Disclaimer & Optimization Notice
Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
51