27
JASM: A Java Library for the Generation and Scheduling of PTX Assembly Christos Kartsaklis [email protected] ICHEC

JASM: A Java Library for the Generation and Scheduling of PTX Assembly

  • Upload
    alton

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

JASM: A Java Library for the Generation and Scheduling of PTX Assembly. Christos Kartsaklis [email protected] ICHEC. Purpose. NVIDIA GPUs Many self-conflicting parameters affect performance. Some not nvcc -tunable . JASM Similar to a compiler back-end but programmable itself. - PowerPoint PPT Presentation

Citation preview

Page 1: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

JASM: A Java Library for the Generation and Scheduling of

PTX AssemblyChristos Kartsaklis

[email protected]

ICHEC

Page 2: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Purpose

• NVIDIA GPUs– Many self-conflicting parameters affect performance.– Some not nvcc-tunable.

• JASM– Similar to a compiler back-end but programmable

itself.– Different constructs to generate variants of the same

kernel.– Explore the optimisations strategy space faster.– For CUDA programmers and instruction bottlenecks.

2

Page 3: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Structure

1. Introduction1. Features absent from nvcc 2. Code compilation3. JASM

2. Library Features1. Dependencies2. Aliasing3. Predication4. Snippets5. Reverse compilation

3. Summary

3

Page 4: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Features absent from nvcc – 1/3

• What is PTX?– Virtual machine ISA for NVIDIA’s GPUs.– Generated by nvcc (-ptxas) and compiled by ptxas.

• nvcc limitation– Cannot write inlined PTX in CUDA C/C++.– Some extra #pragmas needed.

4

Page 5: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Features absent from nvcc – 2/3

• Predication– Not exposed; would like to have:– predicate p = ...; // some condition– #pragma predicate using p– if (p) { ...

• Address space derivation– nvcc must be able to determine what space pointers

refer to; cannot do:– *(__shared__ float* someint) = ...– double d = *(__constant__ double*) foo;

5

Page 6: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Features absent from nvcc – 3/3

• Latency Hints– Hard for the compiler to determine (non)coalesced

accesses; ideally:– #pragma !coalesced foo = bar[i];

• Reflection– __device__ float foo(...) { ... }#pragma N=registersUsedBy(foo);

6

Page 7: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Code compilation – 1/2

• Instruction Generation– Compiler does it for you

1. High-level code intermediary form (I/F).

2. Transform the I/F.

3. Generate machine code from the transformed I/F.

– How good the generated code is?• Need to manually inspect it.

7

Page 8: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Code compilation – 2/2

• Instruction Scheduling– Core part of any compiler.

• Determines the order that instructions will execute in.

– Purpose• Correctness, latency hiding and ILP.

– Problems• Hard to steer from a high-level language.• Compiler often generates its own code.• #pragma directives & compiler options.

8

Page 9: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

JASM – 1/3

• Dedicated tools, such as BAGEL– User selects instructions and BAGEL schedules.– Generating code written in C using the BAGEL lib.– Uses the notion of “abstract instructions”.

• JASM– Similar philosophy – enhanced functionality.

• Focus:– Reflection.– User-Programmable Instruction Scheduling.

9

Page 10: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

JASM – 2/3

• Basic Block (BB)– A bunch of instructions that do not change the flow of

execution.

• Control Flow Graph (CFG)– A directed graph of BBs where edges indicate

changes in execution flow.

• Instructions Stream– The order of instructions in memory.– Each instruction is “notionally” part of a BB.

10

Page 11: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

JASM – 3/3

• Examples– Append an instruction to a basic block:

• lBB.append(LDSHAREDS32, lArg1, 4, lArg2);

– Branching:• lSource.branchTo(lTargetBB, false);

– Reorder:• lBB = lBB.reorder(BASIC_COST_FUNCTION,

DA_PTX_DATA_REGISTER_HAZARDS, DA_ALIASING);

– Predicate:• lBB = lBB.predicate(lP, lRegisterFile);

– Obtain macro:• SnippetDescriptor lSD = CToHLA.obtain(“x*x / (y-z)”,...);

BasicBlock lNewBB = lSD.instantiate(lArg1, ...);

11

Page 12: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Structure

1. Introduction1. Features absent from nvcc 2. Code compilation3. JASM

2. Library Features1. Dependencies2. Aliasing3. Predication4. Snippets5. Reverse compilation

3. Summary

12

Page 13: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Dependencies – 1/2

• All contribute to the final instructions stream.– What is the ideal layout?

• What complicates the compiler– Not enough information to distinguish between true

and false dependencies.– Variable-latency instructions

• E.g. coalesced vs non-coalesced accesses.

13

Page 14: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Dependencies – 2/2

• JASM determines instruction order based on:– Dependency Analysis (DA) modules & Cost Function.

• Full space exploration – no heuristics:– DAs constrain instructions’ motion in the stream.– Cost function estimates execution time for any

stream.– Scheduling done by external constraints solver.

• Only applicable to basic blocks.

14

Page 15: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Aliasing – 1/3

• PTX is not the final thing.– Further optimised by ptxas before machine code

generation.

• Want to specify exactly what is and what’s not aliased.– No #pragma aliased / disjoint in PTX.

• Goal:– Simplify declaration of aliasing/disjoint accesses.– Handle all memory spaces.

15

Page 16: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Aliasing – 2/3

• JASM Addressable Memory Regions (AMRs)• An n-ary tree of AMRs where:

– Root nodes represent spaces (e.g. shared, global)– Each node is a memory region and a sub-region of its

parent’s.– Siblings are disjoint regions, collectively making their

parent’s

• Instructions are associated with AMRs.– AMRs predefined for CUDA memory spaces.

16

Page 17: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Aliasing – 3/3

• Example

• Generally:– No need for pointer grouping (a la “#pragma disjoint”

etc.)– We work with instructions, not pointers.

global mem AMR 01: st.global.f32 [%r0], %f5

02: ld.global.f64 [%r1], %lf8

03: st.global.f32 [%r3], %f4

04: st.global.s32 [%r8], %s1

17

Page 18: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Predication – 1/3

• Conditional execution– if (foo==0) bar=5;– Thread divergence.

• Predicated execution– setp.eq.s32 %p, foo, 0;@%p mov.s32 bar, 5;

– Non-divergent cycle-burner.

• Fine line between the two.• Cannot predicate code explicitly in CUDA.

18

Page 19: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Predication – 2/3

• Explicit– Can allocate predicates, modify them and predicate

instructions.

• Example:– Direct:

Register lP = lRegFile.allocate(BT.PRED, “%p1”);// @%p1 mov.s32 bar, 5lBB.append(PMOVS32, lP, lBar, new Immediate(5));

– By reflection:Instruction lT = new InstructionImpl( MOVS32, lBar, new Immediate(5));lBB.append(lT.descriptor().toPredicatedVersion(), lP, lArg1, lArg2);

19

Page 20: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Predication – 3/3

• Any basic block can be predicated.– Including already-predicated instructions.

• Example:– %p mov.s32 %d, %s; // if (%p) %d = %s;– Predicate by %q

• Output:– @ %q and.pred %t, %p, %q;@!%q mov.pred %t, %q; // i.e. %t = %q ? (%p && %q) : false;@ %t mov.s32 %d, %s;

20

Page 21: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Snippets – 1/3

• Problem:– Certain tasks require knowing in advance how the

compiler treats a piece of code.

• Software pipelining– template<typename T>vmult( T* aDst, T* aSrc1, T* aSrc2) { for(int i=0 ; i<N ; i++) aDst[i] = aSrc2[i] * aSrc2[i];}

21

Page 22: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Snippets – 2/3

• Consider H/W vs S/W instructions– Tradeoff between pipeline stall & register pressure.

• However, register pressure:– Is also a function of the # of thread blocks.

• Ideally– Want to generate pipelined code for a variable

number of dependencies.

• Solution:– Encapsulate function in a reflective macro

• Reflect instructions & dependencies.

22

Page 23: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Snippets – 3/3

• Consider the complex “multiplication”– (a+ib)*(c+id) (ac+bd)+i(ad+bc)– 2 stages: 2 muls, then 2 madds.

• Snippet descriptor organisation:– Group 0:

• mul.f32 ?x, ?a, ?c; mul.f32 ?y, ?a, ?d

– Group 1:• mad.f32 ?x, ?x, ?b, ?d; mad.f32 ?y, ?y, ?b, ?c

– ?* items are parameters.

• Any basic block can be “snippetised”.

23

Page 24: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Reverse compilation – 1/3

• What to do with legacy CUDA code?– Option: Manually re-write in JASM.

• No. Any PTX file can be loaded in JASM.– Not just loaded in. – Organised in basic blocks within a Control Flow

Graph.– Malleable from thereon like every JASM code.

24

Page 25: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Reverse compilation – 2/3

• Inlined C in JASM– Idea: obtain a snippet from a C function.– Opposite of “inlined assembly in C”.– Why?

• Reuse what nvcc makes available.• Enjoy the benefits that come with snippets.

25

Page 26: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Reverse compilation – 3/3

• At the moment, we can do the following:– Code:

• SnippetDescriptor lSD= CToHLA.obtain( “(x*y) % 3”, “int”, “r”, “int”, “x”, “int”, “y”);

– r is the return parameter; x & y are arguments.– Equivalent to:

• int r = (x*y) % 3;

• Now we can write:– if(lSD.numberOfRegisters() > 5) { ...

26

Page 27: JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Summary

• NVIDIA GPUs– Many self-conflicting parameters affect performance.– Some not nvcc-tunable.

• JASM– Similar to a compiler back-end but programmable

itself.– Different constructs to generate variants of the same

kernel.– Explore the optimisations strategy space faster.

• The optimisations are expressed as a function of the code.

27