46
LCU14 BURLINGAME M.Collison, M. Kuvyrkov & Will Newton, LCU14 LCU14-311: Advanced Toolchain Usage

LCU14 311- Advanced Toolchain Usage (parts 1&2)

  • Upload
    linaro

  • View
    376

  • Download
    4

Embed Size (px)

DESCRIPTION

LCU14 311- Advanced Toolchain Usage (parts 1&2) --------------------------------------------------- Speaker: M.Collison, M. Kuvyrkov & Will Newton Date: September 17, 2014 --------------------------------------------------- ★ Session Summary ★ This set of sessions will go into detail on many toolchain topics and help the attendee get the most out of their toolchain usage. Topics covered will include: inline assembly Link Time Optimizations (LTO) Feedback Directed Optimizations (FDO) Proper code annotation for: promoting vectorization avoiding false sharing memory aliasing restrict keyword usage Optimization levels and what they mean Demystifying -march, -mfpu, -mcpu, -mtune, -with-mode Linking options Libatomic usage Debugging binaries compiled with optimizations. --------------------------------------------------- ★ Resources ★ Zerista: http://lcu14.zerista.com/event/member/137758 Google Event: https://plus.google.com/u/0/events/croj226kas0gjarm1mqbm42db9o Video: https://www.youtube.com/watch?v=cy69u5n3qWA&list=UUIVqQKxCyQLJS6xvSmfndLA Etherpad: http://pad.linaro.org/p/lcu14-311 --------------------------------------------------- ★ Event Details ★ Linaro Connect USA - #LCU14 September 15-19th, 2014 Hyatt Regency San Francisco Airport --------------------------------------------------- http://www.linaro.org http://connect.linaro.org

Citation preview

Page 1: LCU14 311- Advanced Toolchain Usage (parts 1&2)

LCU14 BURLINGAME

M.Collison, M. Kuvyrkov & Will Newton, LCU14LCU14-311: Advanced Toolchain Usage

Page 2: LCU14 311- Advanced Toolchain Usage (parts 1&2)

Part 1 Part 2● GCC optimization levels● Using random compiler options● Toolchain defaults by vendor● How to select target flags● Feedback directed optimization● Link-time optimization

● Inline assembly● Auto-vectorization● Minimizing global symbols● Section garbage collection● GNU symbol hash

Page 3: LCU14 311- Advanced Toolchain Usage (parts 1&2)

● Optimization Level 0● Optimization Level 1 (-O1)● Optimization Level 2 (-O2)● Optimization Level 3 (-O3)● Code Size Optimization (-Os)● Optimize for debugging (-Og)

GCC Optimization Levels

Page 4: LCU14 311- Advanced Toolchain Usage (parts 1&2)

● -O0 is equivalent to no optimization.● -O0 is equivalent to providing no optimization option● THIS IS THE DEFAULT

Optimization Level 0

Page 5: LCU14 311- Advanced Toolchain Usage (parts 1&2)

● Enables basic optimizations that attempt to reduce code size and execution time

● Debugging of generated code is minimally affected● Important optimizations enabled:

● Dead code and store elimination on trees and RTL● Basic loop optimizations● Register allocation● If conversion

● Convert conditional jumps into “branch-less equivalents”● Constant propagation● Eliminate redundant jumps to jumps

Optimization Level 1 (-O or -O1)

Page 6: LCU14 311- Advanced Toolchain Usage (parts 1&2)

● Enables all optimizations from –O1 level● Adds more aggressive optimizations at expense of debuggability● Important optimizations enabled:

● Global CSE, constant and copy propagation● Global implies within an entire function not across function boundaries

● Instruction scheduling to take advantage of processor pipeline● Inlining of small functions● Interprocedural constant propagation● Reorder basic blocks to improve cache locality● Partial redundancy elimination

Optimization Level 2 (-O2)

Page 7: LCU14 311- Advanced Toolchain Usage (parts 1&2)

● All optimizations enabled by -O2● Optimizes more aggressively to reduce execution time at the

expense of code size● (Potentially) Inline any function● Loop vectorization to utilize SIMD instructions● Function cloning to make interprocedural constant propagation more powerful● Loop unswitching

Optimization Level 3 (-O3)

Page 8: LCU14 311- Advanced Toolchain Usage (parts 1&2)

● Enables all optimizations as –O2 that do not increase code size● Disables the following –O2 optimizations:

● Optimizations that align the start of functions, loops, branch targets and labels● Reordering of basic blocks

Optimize for Code Size (-Os)

Page 9: LCU14 311- Advanced Toolchain Usage (parts 1&2)

● Enables optimizations that do not interfere with debugging● Debugging (“-g”) must still be enabled● I use “-Og” and “-g” for edit-compile development cycle

Optimize for Debugging (-Og)

Page 10: LCU14 311- Advanced Toolchain Usage (parts 1&2)

● Use -Og and -g for edit-compile-debug cycle● Use -O2 for where code size and execution are important● Use -O3 when execution speed is the primary requirement● Use -Os when code size is the primary requirement

Recommendation for Optimization Options

Page 11: LCU14 311- Advanced Toolchain Usage (parts 1&2)

● 3 years ago I spent 3 days finding the best combination of GCC flags for my project / board / benchmark● -O2 -funroll-loops -fno-schedule-insns --param <some>=<thing>

● … 3 major version of compiler later …

● Why simple -Os outperforms my custom-tuned options?● I thought loop unrolling makes loops go faster.● I saw “-fno-schedule-insns” on the internet.● I hand-tuned --param <some>=<thing>

But I’m experienced, I /know/ the good flags!

Page 12: LCU14 311- Advanced Toolchain Usage (parts 1&2)

● Feature flags (these are OK)● -std=c++11 -- language standard● -fno-common -- language feature● -mcpu=cortex-a15 -mfpu=neon-vfpv4 -- target feature

● Compatibility flags (not OK, please fix your code)● -fno-strict-aliasing● -fsigned-char

● Optimization flags (not OK, please use -Og/-Os/-O2/-O3/-Ofast)● -f<optimization>● -fno-<optimization>● --param <some>=<thing>

But I’m experienced, I /know/ the good flags!

Page 13: LCU14 311- Advanced Toolchain Usage (parts 1&2)

● Linaro (cross)● -mthumb -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mtune=cortex-a9

● Ubuntu (native)● -mthumb -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mtune=cortex-a8

● Debian armhf (native)● -mthumb -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mtune=cortex-a8

● Debian armel (native)● -marm -march=armv4t -mfloat-abi=soft -mtune=arm7tdmi

● Fedora (native)● -marm -march=armv7-a -mfloat-abi=hard -mfpu=vfpv3-d16 -mtune=cortex-a8

● CodeSourcery (cross)● -marm -march=armv5te -mfloat-abi=soft -mtune=arm1026ejs● other multilibs available

So many defaults (AArch32)

Page 14: LCU14 311- Advanced Toolchain Usage (parts 1&2)

● -mcpu=CPU -mfpu=FPU● -mcpu=cortex-a15 -mfpu=neon-vfpv4

● -mcpu=FOO is the same as -march=<FOO’s arch> -mtune=FOO● [-mcpu is preferred option]

● Using ABI options require a matching set of libraries (multilib)● There always is a default multilib for default ABI options● Linaro toolchains have a single -- default -- multilib per toolchain● MEANING OF MULTILIB: set of libraries, not libraries for multiple ABIs.

● For different ABI configurations use different Linaro toolchain packages (or build your own with cbuild2!)

How to select target flags

Page 15: LCU14 311- Advanced Toolchain Usage (parts 1&2)

Feedback directed optimization provides information to the compiler which is then used for making optimization decisions.

● Branch probabilities● Inlining● Hot/cold code reordering and partitioning (not on ARM)

The information used is generated by profiling, which can be done by one of two methods.

● gprof style code instrumentation● Statistical profiling with hardware counters

Feedback Directed Optimization

Page 16: LCU14 311- Advanced Toolchain Usage (parts 1&2)

1. Build the code with appropriate options to add profiling instrumentation

-fprofile-generate=dir, where dir is the output directory 2. Run the application with a representative workload.3. Rebuild the code with profile generated by the run.

-fprofile-use=dir, where dir is the same directory as before

This results in two build types, the slower instrumented build and the final optimized build.

Using code instrumentation

Page 17: LCU14 311- Advanced Toolchain Usage (parts 1&2)

In this example I used the Opus 1.1 codec encoder test and gcc 4.8.3 on x86_64.

Performance

Build Type Run Time Relative Run Time

Default 27.727s 100%

Instrumented 34.008s 123%

Optimized 24.301s 88%

Page 18: LCU14 311- Advanced Toolchain Usage (parts 1&2)

Instrumenting and optimizing based on profiles also adds some overhead to compile times.

Build Time

Build Type Build Time Relative Build Time

Default 42.410s 100%

Instrumented 55.508s 131%

Optimized 70.544s 166%

Page 19: LCU14 311- Advanced Toolchain Usage (parts 1&2)

A new method of feedback directed optimization developed by Google. Uses perf to generate profiles using optimized binaries with debug information.

https://github.com/google/autofdo1. Build a standard optimized build (with debug info).2. Run the application with perf record branch profiling.3. Convert profile with autofdo tool.4. Build with -fauto-profile.Only supported in Google’s gcc branch not on master. Provides around 70-80% of the performance benefits of the instrumentation method but profiling overhead is only around 2%.

AutoFDO

Page 20: LCU14 311- Advanced Toolchain Usage (parts 1&2)

● Allows optimizations that work on a entire file to work across the entire application

● Works by saving the compiler IL in object files and using the IL to optimize at “link-time”

● Enabled with “–flto”● -fuse-linker-plugin allows LTO to be applied to object files in libraries (assuming proper

linker support)● Limitation: Use same command line options when compiling

source files● gcc –O2 –flto –c a.c● gcc –O2 –flto –c b.c● gcc –o a.out a.o b.o -flto

● LTO is production ready in gcc 4.9

Link Time Optimization (LTO)

Page 21: LCU14 311- Advanced Toolchain Usage (parts 1&2)

Part 1 Part 2● GCC optimization levels● Using random compiler options● Toolchain defaults by vendor● How to select target flags● Feedback directed optimization● Link-time optimization

● Inline assembly● Auto-vectorization● Minimizing global symbols● Section garbage collection● GNU symbol hash

Page 22: LCU14 311- Advanced Toolchain Usage (parts 1&2)

● Using instructions compiler does not know about● Are you sure -- check latest built-ins / intrinsics!● Privileged instructions● Syscall / interrupt instructions

● Basic Asm● asm (“INSN1”);● Limited use; all operands must already be in specific registers● See docs: https://gcc.gnu.org/onlinedocs/gcc/Basic-Asm.html

● Extended Asm● asm (“TEMPLATE” : “OUTPUTS” : “INPUTS” : “CLOBBERS”);● See glibc or linux kernel for inspiration● See docs: https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html

Inline Assembly

Page 23: LCU14 311- Advanced Toolchain Usage (parts 1&2)

● “asm (:::);” is just another normal statement

● GCC optimizes asm statements just like any other statements

● Programmer is responsible for specifying ALL effects of asm

● “asm volatile (:::);”● Number of executions, not presence in code, is guaranteed.

Inline Assembly -- statements

Page 24: LCU14 311- Advanced Toolchain Usage (parts 1&2)

Wrongint fund (int arg) { asm (“insn r0”); // I know ABI return arg;}

Inline Assembly -- variablesCorrectint func (int _arg) { int arg asm(“r0”) = _arg; asm (“insn %0” : “+r” (arg)); return arg;}

Page 25: LCU14 311- Advanced Toolchain Usage (parts 1&2)

Vectorization performs multiple iterations of a loop (or repeated operation) using vector instructions that operate on multiple data items simultaneously. gcc is capable of identifying code that can be vectorized and applying this transformation.

Compiler flags to enable this optimization:● -O3● -ftree-vectorize

Auto-vectorization

Page 26: LCU14 311- Advanced Toolchain Usage (parts 1&2)

A simple loop to vectorize: #define SIZE (1UL << 16)

void test1(double *a, double *b)

{

for (size_t i = 0; i < SIZE; i++)

a[i] += b[i];

}

Auto-vectorization Example

Page 27: LCU14 311- Advanced Toolchain Usage (parts 1&2)

What code is generated by gcc -std=c99 -O2 -mfpu=neon?test1:

movs r3, #0

.L3:

fldd d16, [r0]

fldmiad r1!, {d17}

faddd d16, d16, d17

adds r3, r3, #1

cmp r3, #65536

fstmiad r0!, {d16}

bne .L3

bx lr

Auto-vectorization Example

Page 28: LCU14 311- Advanced Toolchain Usage (parts 1&2)

What code is generated by gcc -std=c99 -O3 -mfpu=neon?

The code is unchanged. Why did we not see any vectorization? gcc provides -ftree-vectorizer-verbose to help.

test.c:9: note: not vectorized: no vectype for stmt: _7 = *_6;

scalar_type: double

ARMv7 NEON does not support vectorizing double precision operations so gcc cannot vectorize the loop.

Auto-vectorization Example

Page 29: LCU14 311- Advanced Toolchain Usage (parts 1&2)

So how about we switch to float. Does it vectorize?

No. What do we get from -ftree-vectorizer-verbose?

test.c:8: note: not vectorized: relevant stmt not supported: _11 = _7 + _10;

test.c:8: note: bad operation or unsupported loop bound.

NEON does not support full IEEE 754, so gcc won’t use it.

Auto-vectorization Example

Page 30: LCU14 311- Advanced Toolchain Usage (parts 1&2)

If we know that our data does not contain any problematic values (denormals or non-default NaNs) and we can deal with the other restrictions (round to nearest, no traps) we can tell gcc NEON is OK with -funsafe-math-optimizations.

Finally, we see vector instructions!

Auto-vectorization Example

Page 31: LCU14 311- Advanced Toolchain Usage (parts 1&2)

test1: .L4: .L5:

add r3, r1, #16 vld1.32 {q9}, [r1]! flds s15, [r0]

add r2, r0, #16 vld1.32 {q8}, [r0] fldmias r1!, {s14}

cmp r0, r3 vadd.f32 q8, q9, q8 fadds s15, s14, s15

it cc vst1.32 {q8}, [r0]! adds r3, r3, #1

cmpcc r1, r2 cmp r0, r3 cmp r3, #65536

ite cs bne .L4 fstmias r0!, {s15}

movcs r3, #1 bx lr bne .L5

movcc r3, #0 bx lr

bcc .L5

add r3, r0, #262144

Auto-vectorization Example

Page 32: LCU14 311- Advanced Toolchain Usage (parts 1&2)

That’s still quite a lot of code, how can we improve it? Use the restrict keyword to annotate that the two arrays do not alias (overlap).

#define SIZE (1UL << 16)

void test1(float * restrict a, float * restrict b)

{

for (size_t i = 0; i < SIZE; i++)

a[i] += b[i];

}

Auto-vectorization Example

Page 33: LCU14 311- Advanced Toolchain Usage (parts 1&2)

Well, that was unexpected!test1: flds s15, [r1, #8] vld1.64 {d16-d17},[r2:64]! lsls r4, r4, #2 .L1:

sbfx r3, r0, #2, #1 fadds s15, s14, s15 vadd.f32 q8, q9, q8 cmp r6, #2 pop {r4,r5,r6,r7,r8,pc}

ands r3, r3, #3 movw ip, #65533 vst1.64 {d16-d17},[r4:64]! add r2, r0, r4 .L9:

push {r4,r5,r6,r7,r8,lr} mov r8, #3 bhi .L8 add r4, r4, r1 mov ip, #65536

beq .L9 fsts s15, [r0, #8] cmp lr, r7 flds s14, [r2] mov r8, r3

flds s14, [r0] .L2: add r3, r8, r7 flds s15, [r4] b .L2

flds s15, [r1] rsb lr, r3, #65536 rsb r6, r7, ip fadds s15, s14, s15 .L11:

fadds s15, s14, s15 lsls r5, r3, #2 beq .L1 add r3, r3, #2 movw ip, #65534

cmp r3, #1 adds r2, r0, r5 lsls r5, r3, #2 fsts s15, [r2] mov r8, #2

fsts s15, [r0] add r5, r5, r1 cmp r6, #1 beq .L1 b .L2

bls .L10 lsr r6, lr, #2 add r2, r0, r5 lsls r3, r3, #2 .L10:

flds s14, [r0, #4] movs r3, #0 add r5, r5, r1 add r0, r0, r3 movw ip, #65535

flds s15, [r1, #4] mov r4, r2 flds s14, [r2] add r3, r3, r1 mov r8, #1

fadds s15, s14, s15 lsls r7, r6, #2 flds s15, [r5] flds s14, [r0] b .L2

cmp r3, #2 .L8: fadds s15, s14, s15 flds s15, [r3]

fsts s15, [r0, #4] adds r3, r3, #1 add r4, r3, #1 fadds s15, s14, s15

bls .L11 vld1.32 {q9}, [r5]! fsts s15, [r2] fsts s15, [r0]

flds s14, [r0, #8] cmp r6, r3 beq .L1 pop {r4,r5,r6,r7,r8,pc}

Auto-vectorization Example

Page 34: LCU14 311- Advanced Toolchain Usage (parts 1&2)

gcc is expending a lot of instructions making sure the pointers are aligned to an 8 byte boundary. Often this can be guaranteed by the allocator or data structure layout.

void test1(float * restrict a_, float * restrict b_)

{

float *a = __builtin_assume_aligned(a_, 8);

float *b = __builtin_assume_aligned(b_, 8);

for (size_t i = 0; i < SIZE; i++)

a[i] += b[i];

}

Auto-vectorization Example

Page 35: LCU14 311- Advanced Toolchain Usage (parts 1&2)

Now we have something that looks fairly optimal.test1:

add r3, r0, #262144

.L3:

vld1.64 {d16-d17}, [r0:64]

vld1.64 {d18-d19}, [r1:64]!

vadd.f32 q8, q8, q9

vst1.64 {d16-d17}, [r0:64]!

cmp r0, r3

bne .L3

bx lr

Auto-vectorization Example

Page 36: LCU14 311- Advanced Toolchain Usage (parts 1&2)

● Use the right types● Understand the implications for mathematical operations● Use restrict annotations where possible● Use vector aligned pointers where possible and annotate them● Use countable loop conditions e.g. i < n● Don’t do control flow in the loop e.g. break, function calls● Experiment with -ftree-vectorizer-verbose

Auto-vectorization Tips

Page 37: LCU14 311- Advanced Toolchain Usage (parts 1&2)

Reducing the number of global symbols in shared objects is beneficial for a number of reasons.● Reduced startup time● Faster function calls● Smaller disk and memory footprint

There a number of ways to achieve this goal:● Make as many functions as possible static● Use a version script to force symbols local● Use -fvisibility=hidden and symbol attributes● Use ld -Bsymbolic

Minimizing Global Symbols

Page 38: LCU14 311- Advanced Toolchain Usage (parts 1&2)

-Bsymbolic binds global references within a shared library to definitions within the shared library where possible, bypassing the PLT for functions. -Bsymbolic-functions behaves similarly but applies only to functions.

This breaks symbol preemption and pointer comparison so cannot be applied without a certain amount of care. -Bsymbolic-functions is safer as comparison of function pointers is rarer than comparison of data pointers.

-Bsymbolic

Page 39: LCU14 311- Advanced Toolchain Usage (parts 1&2)

lib1.c:int func1(int a)

{

return 1 + func2(a);

}

lib2.c: int func2(int a)

{

return a*2;

}

-Bsymbolic Example

Page 40: LCU14 311- Advanced Toolchain Usage (parts 1&2)

gcc -O2 -shared -o lib.so lib1.o lib2.o

00000540 <func1>:

540: b508 push {r3, lr}

542: f7ff ef7e blx 440 <_init+0x38>

546: 3001 adds r0, #1

548: bd08 pop {r3, pc}

54a: bf00 nop

0000054c <func2>:

54c: 0040 lsls r0, r0, #1

54e: 4770 bx lr

-Bsymbolic Example

Page 41: LCU14 311- Advanced Toolchain Usage (parts 1&2)

DYNAMIC RELOCATION RECORDS

OFFSET TYPE VALUE

00008f14 R_ARM_RELATIVE *ABS*

00008f18 R_ARM_RELATIVE *ABS*

0000902c R_ARM_RELATIVE *ABS*

00009018 R_ARM_GLOB_DAT __cxa_finalize

0000901c R_ARM_GLOB_DAT _ITM_deregisterTMCloneTable

00009020 R_ARM_GLOB_DAT __gmon_start__

00009024 R_ARM_GLOB_DAT _Jv_RegisterClasses

00009028 R_ARM_GLOB_DAT _ITM_registerTMCloneTable

0000900c R_ARM_JUMP_SLOT __cxa_finalize

00009010 R_ARM_JUMP_SLOT __gmon_start__

00009014 R_ARM_JUMP_SLOT func2

-Bsymbolic Example

Page 42: LCU14 311- Advanced Toolchain Usage (parts 1&2)

gcc -O2 -shared -Wl,-Bsymbolic-functions -o liblib.so lib1.o lib2.o0000052c <func1>:

52c: b508 push {r3, lr}

52e: f000 f803 bl 538 <func2>

532: 3001 adds r0, #1

534: bd08 pop {r3, pc}

536: bf00 nop

00000538 <func2>:

538: 0040 lsls r0, r0, #1

53a: 4770 bx lr

-Bsymbolic Example

Page 43: LCU14 311- Advanced Toolchain Usage (parts 1&2)

DYNAMIC RELOCATION RECORDS

OFFSET TYPE VALUE

00008f14 R_ARM_RELATIVE *ABS*

00008f18 R_ARM_RELATIVE *ABS*

00009028 R_ARM_RELATIVE *ABS*

00009014 R_ARM_GLOB_DAT __cxa_finalize

00009018 R_ARM_GLOB_DAT _ITM_deregisterTMCloneTable

0000901c R_ARM_GLOB_DAT __gmon_start__

00009020 R_ARM_GLOB_DAT _Jv_RegisterClasses

00009024 R_ARM_GLOB_DAT _ITM_registerTMCloneTable

0000900c R_ARM_JUMP_SLOT __cxa_finalize

00009010 R_ARM_JUMP_SLOT __gmon_start__

-Bsymbolic Example

Page 44: LCU14 311- Advanced Toolchain Usage (parts 1&2)

ld is capable of dropping any unused input sections from the final link. It does this by following references between sections from an entry point, and un-referenced sections are removed (or garbage collected).● Compile with -ffunction-sections and -fdata-sections● Link with --gc-sections● Only helps on projects that contain some redundancy

Section Garbage Collection

Page 45: LCU14 311- Advanced Toolchain Usage (parts 1&2)

Dynamic objects contain a hash to map symbol names to addresses. The GNU hash feature implemented in ld and glibc performs considerably better than the standard ELF hash. ● Fast hash function with good collision avoidance● Bloom filters to quickly check for symbol in a hash● Symbols sorted for cache locality

Creation of a GNU hash section can be enabled by passing --hash-style=gnu or --hash-style=both to ld. The Android dynamic linker does not currently support GNU hash sections!

GNU Symbol Hash

Page 46: LCU14 311- Advanced Toolchain Usage (parts 1&2)

More about Linaro Connect: connect.linaro.org Linaro members: www.linaro.org/membersMore about Linaro: www.linaro.org/about/