JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Preview:

DESCRIPTION

JASM: A Java Library for the Generation and Scheduling of PTX Assembly. Christos Kartsaklis christos.kartsaklis@ichec.ie ICHEC. Purpose. NVIDIA GPUs Many self-conflicting parameters affect performance. Some not nvcc -tunable . JASM Similar to a compiler back-end but programmable itself. - PowerPoint PPT Presentation

Citation preview

JASM: A Java Library for the Generation and Scheduling of

PTX AssemblyChristos Kartsaklis

christos.kartsaklis@ichec.ie

ICHEC

Purpose

• NVIDIA GPUs– Many self-conflicting parameters affect performance.– Some not nvcc-tunable.

• JASM– Similar to a compiler back-end but programmable

itself.– Different constructs to generate variants of the same

kernel.– Explore the optimisations strategy space faster.– For CUDA programmers and instruction bottlenecks.

2

Structure

1. Introduction1. Features absent from nvcc 2. Code compilation3. JASM

2. Library Features1. Dependencies2. Aliasing3. Predication4. Snippets5. Reverse compilation

3. Summary

3

Features absent from nvcc – 1/3

• What is PTX?– Virtual machine ISA for NVIDIA’s GPUs.– Generated by nvcc (-ptxas) and compiled by ptxas.

• nvcc limitation– Cannot write inlined PTX in CUDA C/C++.– Some extra #pragmas needed.

4

Features absent from nvcc – 2/3

• Predication– Not exposed; would like to have:– predicate p = ...; // some condition– #pragma predicate using p– if (p) { ...

• Address space derivation– nvcc must be able to determine what space pointers

refer to; cannot do:– *(__shared__ float* someint) = ...– double d = *(__constant__ double*) foo;

5

Features absent from nvcc – 3/3

• Latency Hints– Hard for the compiler to determine (non)coalesced

accesses; ideally:– #pragma !coalesced foo = bar[i];

• Reflection– __device__ float foo(...) { ... }#pragma N=registersUsedBy(foo);

6

Code compilation – 1/2

• Instruction Generation– Compiler does it for you

1. High-level code intermediary form (I/F).

2. Transform the I/F.

3. Generate machine code from the transformed I/F.

– How good the generated code is?• Need to manually inspect it.

7

Code compilation – 2/2

• Instruction Scheduling– Core part of any compiler.

• Determines the order that instructions will execute in.

– Purpose• Correctness, latency hiding and ILP.

– Problems• Hard to steer from a high-level language.• Compiler often generates its own code.• #pragma directives & compiler options.

8

JASM – 1/3

• Dedicated tools, such as BAGEL– User selects instructions and BAGEL schedules.– Generating code written in C using the BAGEL lib.– Uses the notion of “abstract instructions”.

• JASM– Similar philosophy – enhanced functionality.

• Focus:– Reflection.– User-Programmable Instruction Scheduling.

9

JASM – 2/3

• Basic Block (BB)– A bunch of instructions that do not change the flow of

execution.

• Control Flow Graph (CFG)– A directed graph of BBs where edges indicate

changes in execution flow.

• Instructions Stream– The order of instructions in memory.– Each instruction is “notionally” part of a BB.

10

JASM – 3/3

• Examples– Append an instruction to a basic block:

• lBB.append(LDSHAREDS32, lArg1, 4, lArg2);

– Branching:• lSource.branchTo(lTargetBB, false);

– Reorder:• lBB = lBB.reorder(BASIC_COST_FUNCTION,

DA_PTX_DATA_REGISTER_HAZARDS, DA_ALIASING);

– Predicate:• lBB = lBB.predicate(lP, lRegisterFile);

– Obtain macro:• SnippetDescriptor lSD = CToHLA.obtain(“x*x / (y-z)”,...);

BasicBlock lNewBB = lSD.instantiate(lArg1, ...);

11

Structure

1. Introduction1. Features absent from nvcc 2. Code compilation3. JASM

2. Library Features1. Dependencies2. Aliasing3. Predication4. Snippets5. Reverse compilation

3. Summary

12

Dependencies – 1/2

• All contribute to the final instructions stream.– What is the ideal layout?

• What complicates the compiler– Not enough information to distinguish between true

and false dependencies.– Variable-latency instructions

• E.g. coalesced vs non-coalesced accesses.

13

Dependencies – 2/2

• JASM determines instruction order based on:– Dependency Analysis (DA) modules & Cost Function.

• Full space exploration – no heuristics:– DAs constrain instructions’ motion in the stream.– Cost function estimates execution time for any

stream.– Scheduling done by external constraints solver.

• Only applicable to basic blocks.

14

Aliasing – 1/3

• PTX is not the final thing.– Further optimised by ptxas before machine code

generation.

• Want to specify exactly what is and what’s not aliased.– No #pragma aliased / disjoint in PTX.

• Goal:– Simplify declaration of aliasing/disjoint accesses.– Handle all memory spaces.

15

Aliasing – 2/3

• JASM Addressable Memory Regions (AMRs)• An n-ary tree of AMRs where:

– Root nodes represent spaces (e.g. shared, global)– Each node is a memory region and a sub-region of its

parent’s.– Siblings are disjoint regions, collectively making their

parent’s

• Instructions are associated with AMRs.– AMRs predefined for CUDA memory spaces.

16

Aliasing – 3/3

• Example

• Generally:– No need for pointer grouping (a la “#pragma disjoint”

etc.)– We work with instructions, not pointers.

global mem AMR 01: st.global.f32 [%r0], %f5

02: ld.global.f64 [%r1], %lf8

03: st.global.f32 [%r3], %f4

04: st.global.s32 [%r8], %s1

17

Predication – 1/3

• Conditional execution– if (foo==0) bar=5;– Thread divergence.

• Predicated execution– setp.eq.s32 %p, foo, 0;@%p mov.s32 bar, 5;

– Non-divergent cycle-burner.

• Fine line between the two.• Cannot predicate code explicitly in CUDA.

18

Predication – 2/3

• Explicit– Can allocate predicates, modify them and predicate

instructions.

• Example:– Direct:

Register lP = lRegFile.allocate(BT.PRED, “%p1”);// @%p1 mov.s32 bar, 5lBB.append(PMOVS32, lP, lBar, new Immediate(5));

– By reflection:Instruction lT = new InstructionImpl( MOVS32, lBar, new Immediate(5));lBB.append(lT.descriptor().toPredicatedVersion(), lP, lArg1, lArg2);

19

Predication – 3/3

• Any basic block can be predicated.– Including already-predicated instructions.

• Example:– %p mov.s32 %d, %s; // if (%p) %d = %s;– Predicate by %q

• Output:– @ %q and.pred %t, %p, %q;@!%q mov.pred %t, %q; // i.e. %t = %q ? (%p && %q) : false;@ %t mov.s32 %d, %s;

20

Snippets – 1/3

• Problem:– Certain tasks require knowing in advance how the

compiler treats a piece of code.

• Software pipelining– template<typename T>vmult( T* aDst, T* aSrc1, T* aSrc2) { for(int i=0 ; i<N ; i++) aDst[i] = aSrc2[i] * aSrc2[i];}

21

Snippets – 2/3

• Consider H/W vs S/W instructions– Tradeoff between pipeline stall & register pressure.

• However, register pressure:– Is also a function of the # of thread blocks.

• Ideally– Want to generate pipelined code for a variable

number of dependencies.

• Solution:– Encapsulate function in a reflective macro

• Reflect instructions & dependencies.

22

Snippets – 3/3

• Consider the complex “multiplication”– (a+ib)*(c+id) (ac+bd)+i(ad+bc)– 2 stages: 2 muls, then 2 madds.

• Snippet descriptor organisation:– Group 0:

• mul.f32 ?x, ?a, ?c; mul.f32 ?y, ?a, ?d

– Group 1:• mad.f32 ?x, ?x, ?b, ?d; mad.f32 ?y, ?y, ?b, ?c

– ?* items are parameters.

• Any basic block can be “snippetised”.

23

Reverse compilation – 1/3

• What to do with legacy CUDA code?– Option: Manually re-write in JASM.

• No. Any PTX file can be loaded in JASM.– Not just loaded in. – Organised in basic blocks within a Control Flow

Graph.– Malleable from thereon like every JASM code.

24

Reverse compilation – 2/3

• Inlined C in JASM– Idea: obtain a snippet from a C function.– Opposite of “inlined assembly in C”.– Why?

• Reuse what nvcc makes available.• Enjoy the benefits that come with snippets.

25

Reverse compilation – 3/3

• At the moment, we can do the following:– Code:

• SnippetDescriptor lSD= CToHLA.obtain( “(x*y) % 3”, “int”, “r”, “int”, “x”, “int”, “y”);

– r is the return parameter; x & y are arguments.– Equivalent to:

• int r = (x*y) % 3;

• Now we can write:– if(lSD.numberOfRegisters() > 5) { ...

26

Summary

• NVIDIA GPUs– Many self-conflicting parameters affect performance.– Some not nvcc-tunable.

• JASM– Similar to a compiler back-end but programmable

itself.– Different constructs to generate variants of the same

kernel.– Explore the optimisations strategy space faster.

• The optimisations are expressed as a function of the code.

27