30
1 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

HH QUALCOMM using qualcomm® snapdragon™ llvm compiler to optimize apps for 32 and 64 bit

Embed Size (px)

Citation preview

1 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Using Qualcomm® Snapdragon™ LLVM compiler to optimize apps for 32 and 64 Bit

Zino Benaissa Engineer, Principal/Manager Qualcomm Innovation Center, Inc.

Qualcomm Snapdragon is a product of Qualcomm Technologies, Inc.

3 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Outline

• Introduction

• Coding guidelines for performance

• LLVM optimization pragmas

• LLVM internal flags

• Summary

4 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Introduction

5 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Software engineering Software applications are growing exponentially

• Software quality and security − Many tools to fight bugs, scrutinize source code for security holes. LLVM community is developing such

tools: − Static analyzer

− Sanitizers: − Address

− Undefined behavior

− Loop coverage tools

• Performance − Well, hardware/compilers are smart and they are!

− But often performance goals are not met. In this case programmers are on their own − Costly analysis is required

− Ad hoc methods are used − Inspection of assembly code and code rewrite

6 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Compilers Compilers are formidable tools

• They have evolved along with the hardware evolution − Superscalar, SIMD, multi-core, 64 bits

• Typical industrial compiler includes over hundred optimizations

• Many powerful optimizations has been actively researched and developed to target hardware features − Loop auto-vectorization targeting SIMD execution unit

− Loop auto-parallelization targeting multi-cores

• Work correctly on any program

• Produce fast code

• Maximize utilization of hardware capabilities

Programmer expectations

7 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Compilers Compilers are just programs. Programmers should be aware

• Contains thousands bugs like any other large software

• Optimizations have limitations − Can fail to apply on legitimate piece of code

• Lack “expected” optimization − No assumption of what the compiler will do

• Systematic but typically unable to infer critical knowledge of domain experts

8 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Compilers The good news: minor rewrites of source code often trigger optimizations

• Following simple coding guidelines can significantly increase compiler effectiveness

• Compiler knows why an optimization did not apply − The LLVM community is actively developing optimization reporting feature targeted for release 3.6

− The Snapdragon LLVM team are extending this feature − Early preview of this feature is possible

9 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Coding Guidelines for Performance

10 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Sample code included in this presentation is made available subject to The Clear BSD License Copyright (c) 2014 Qualcomm Innovation Center, Inc.

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted (subject to the limitations in the disclaimer below) provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

* Neither the name of Qualcomm Innovation Center, Inc. nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

11 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Guideline 1

void foo(int *A) { for (int i = 0; i < computeN(); i++) A[i] += 1; }

12 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Guideline 1: Make the loop trip count known

Loop Rewrite to void foo(int *A) { for (int i = 0; i < computeN(); i++) A[i] += 1; }

void foo(int *A) { int n = computeN(); for (int i = 0; i < n; i++) A[i] += 1; }

computeN() need to be evaluated every loop iteration computeN() is evaluated only once

13 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Guideline 2

void foo(int *myArray, unsigned n) { for (unsigned i = 0; i < n; i += 2) myArray[i] += 1; }

14 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Guideline 2: Use signed type

Loop Rewrite to void foo(int *myArray, unsigned n) { for (unsigned i = 0; i < n; i += 2) myArray[i] += 1; }

void foo(int *myArray, unsigned n) { for (int i = 0; i < n; i += 2) myArray[i] += 1; }

Unsigned type has modulo (wrap) semantic. Because variable i can overflow, compiler cannot assume it executes n iterations

Overflow of signed type is undefined. Compiler assumes loop counter never overflows.

15 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Guideline 3

void foo(MyStruct *s) { for (int i = 0; i < s->NumElm; i++) s->MyArray[i] += 1; }

16 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Guideline 3: Beware of pointer aliasing

Loop Rewrite to void foo(MyStruct *s) { for (int i = 0; i < s->NumElm; i++) s->MyArray[i] += 1; }

void foo(MyStruct *s) { int n = s->NumElm; for (int i = 0; i < n; i++) s->MyArray[i] += 1; }

Programmer should not assume that the compiler will be able to hoist s->NumElm

Compiler knows the number of loop iterations

17 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Guidelines 4

typedef struct { int **b; } S; void foo(S *A) { for (int i = 0; i < 100; i++) A->b[i] = nullptr; }

18 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Guideline 4: Hoist complex pointer indirections

Loop Rewrite to typedef struct { int **b; } S; void foo(S *A) { for (int i = 0; i < 100; i++) A->b[i] = nullptr; }

typedef struct { int **b; } S; void foo(S *A) { int **ptr = A->b; for (int i = 0; i < 100; i++) ptr[i] = nullptr; }

A->b is evaluated every iterations If there are more that 2 levels of pointer/struct indirections. Hoist outside loop.

19 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Guideline 5

void foo(int *A, int *B) { for (int i = 0; i < 100; i++) A[i] += B[i]; }

20 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Guideline 5: Use restrict keyword

Loop Rewrite to void foo(int *A, int *B) { for (int i = 0; i < 100; i++) A[i] += B[i]; }

void foo(int *__restrict A, int *__restrict B) { for (int i = 0; i < 100; i++) A[i] += B[i]; }

The loop cannot be parallelized because the compiler has to worry about 1 case: A is pointing to B[i+1]

Tells the compiler that A and B are pointing to separate arrays.

LLVM vectorizes this loop without restrict. It generates run time checks to verify A and B are not overlapping

21 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Guideline 6

void foo(int *A, int n, int m) { for (int i = 0; i < n ; i++) { for (int j = 0; j < m ; j++) { if (j != m - 1) *A |= 1; if (i != n – 1) *A |= 2; if (j != 0) *A |= 4; if (i != 0) *A |= 8; A++; } } }

Most elements of A will be set with *A | 15

Last iteration excluded

First iteration excluded

22 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Guideline 6: Avoid complex control-flow

Loop Rewrite to void foo(int *A, int n, int m) { for (int i = 0; i < n ; i++) { for (int j = 0; j < m ; j++) { if (j != m - 1) *A |= 1; if (i != n – 1) *A |= 2; if (j != 0) *A |= 4; if (i != 0) *A |= 8; A++; } } }

void foo(int *A, int n, int m) { // Handle cases n == 1 and m == 1 // Peel iteration when i is 0 // Most executed loop for (i = 1; i < n - 1; i++) { *A++ |= 11; /* iter j = 0 */ for (int j = 1; j < m - 1; j++) *A++ |= 15; *A++ |= 14; /* iter j = m - 1 */ } // Peel iteration i = n – 1 }

Last and first iterations are peeled

Most common executed code

©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

23 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

LLVM Optimization Pragmas

24 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Guideline 7: Use pragma vectorize

Loop void foo(int *A, int n) { for (int i = 0; i < n % 4; i++) A[i] += 1; }

Loop has too few iterations

25 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Guideline 7: Use pragma vectorize

Loop Rewrite to void foo(int *A, int n) { for (int i = 0; i < n % 4; i++) A[i] += 1; }

void foo(int *A, int n) { #pragma clang loop vectorize(disable) for (int i = 0; i < n % 4; i++) A[i] += 1; }

Compiler often has no way to know n is less than three Beware pragma often are target dependent. Apply only to intended target Pragmas override command line flags

Programmer cannot assume the compiler will figure out that loop has at least four iterations

pragmas will be supported in the upcoming Snapdragon LLVM release

26 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Guideline 7: Use pragma vectorize Example 2

Loop Rewrite to void foo(char *A, int n) { n = min(14, n); for (int i = 0; i < n; i++) A[i] += 1; }

void foo(char *A, int n) { n = min(14, n); #pragma clang loop vectorize_width(8) for (int i = 0; i < n; i++) A[i] += 1; }

Compiler is unaware there is at most 15 iterations. It will attempt to vectorize using a factor of 16 to fill ARM/NEON registers (128 bits)

Compiler will vectorize using a factor 8. When n >= 8, vector instructions are used.

27 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

LLVM Internal Flags

28 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

LLVM hidden optimization flags

• Compiler utilizes various heuristics and optimization threshold − Preset depending on optimization level

• Many optimizations are experimental and remain turned off

• Controlled by command line compiler flags − “clang –help-hidden” displays all available flags

• Difficult to utilize them − Can significantly accelerate specific pieces of code

− Unsafe to use in general

• Typically reserved to advanced programmers and compiler developers − In future, compiler reporting to suggest usage of a subset of these flags

29 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Summary

• Coding guidelines can make compilers significantly more effective − Significant speed up

• Guidelines are only useful while the code remains readable − Avoid obscure and complex source changes

• Use Domain expert knowledge − LLVM supported pragmas

• Snapdragon LLVM compiler available at Qualcomm Developer Nework

30 ©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

For more information on Qualcomm, visit us at: www.qualcomm.com & www.qualcomm.com/blog

©2013-2014 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved. Qualcomm and Snapdragon are trademarks of Qualcomm Incorporated, registered in the United States and other countries, used with permission. Uplinq is a trademark of Qualcomm Incorporated, used with permission. Other products and brand names may be trademarks or registered trademarks of their respective owners of their respective owners. References in this presentation to “Qualcomm” may mean Qualcomm Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries or business units within the Qualcomm corporate structure, as applicable. Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s engineering, research and development functions, and substantially all of its product and services businesses, including its semiconductor business, QCT.

Thank you FOLLOW US ON: