JASM: A Java Library for the Generation and Scheduling of PTX Assembly

JASM: A Java Library for the Generation and Scheduling of

PTX AssemblyChristos Kartsaklis

christos.kartsaklis@ichec.ie

Purpose

• NVIDIA GPUs– Many self-conflicting parameters affect performance.– Some not nvcc-tunable.

• JASM– Similar to a compiler back-end but programmable

itself.– Different constructs to generate variants of the same

kernel.– Explore the optimisations strategy space faster.– For CUDA programmers and instruction bottlenecks.

Structure

1. Introduction1. Features absent from nvcc 2. Code compilation3. JASM

2. Library Features1. Dependencies2. Aliasing3. Predication4. Snippets5. Reverse compilation

3. Summary

Features absent from nvcc – 1/3

• What is PTX?– Virtual machine ISA for NVIDIA’s GPUs.– Generated by nvcc (-ptxas) and compiled by ptxas.

• nvcc limitation– Cannot write inlined PTX in CUDA C/C++.– Some extra #pragmas needed.

• Predication– Not exposed; would like to have:– predicate p = ...; // some condition– #pragma predicate using p– if (p) { ...

• Address space derivation– nvcc must be able to determine what space pointers

refer to; cannot do:– *(__shared__ float* someint) = ...– double d = *(__constant__ double*) foo;

• Latency Hints– Hard for the compiler to determine (non)coalesced

accesses; ideally:– #pragma !coalesced foo = bar[i];

• Reflection– __device__ float foo(...) { ... }#pragma N=registersUsedBy(foo);

Code compilation – 1/2

• Instruction Generation– Compiler does it for you

1. High-level code intermediary form (I/F).

2. Transform the I/F.

3. Generate machine code from the transformed I/F.

– How good the generated code is?• Need to manually inspect it.

Code compilation – 2/2

• Instruction Scheduling– Core part of any compiler.

• Determines the order that instructions will execute in.

– Purpose• Correctness, latency hiding and ILP.

– Problems• Hard to steer from a high-level language.• Compiler often generates its own code.• #pragma directives & compiler options.

JASM – 1/3

• Dedicated tools, such as BAGEL– User selects instructions and BAGEL schedules.– Generating code written in C using the BAGEL lib.– Uses the notion of “abstract instructions”.

• JASM– Similar philosophy – enhanced functionality.

• Focus:– Reflection.– User-Programmable Instruction Scheduling.

JASM – 2/3

• Basic Block (BB)– A bunch of instructions that do not change the flow of

execution.

• Control Flow Graph (CFG)– A directed graph of BBs where edges indicate

changes in execution flow.

• Instructions Stream– The order of instructions in memory.– Each instruction is “notionally” part of a BB.

JASM – 3/3

• Examples– Append an instruction to a basic block:

• lBB.append(LDSHAREDS32, lArg1, 4, lArg2);

– Branching:• lSource.branchTo(lTargetBB, false);

– Reorder:• lBB = lBB.reorder(BASIC_COST_FUNCTION,

DA_PTX_DATA_REGISTER_HAZARDS, DA_ALIASING);

– Predicate:• lBB = lBB.predicate(lP, lRegisterFile);

– Obtain macro:• SnippetDescriptor lSD = CToHLA.obtain(“x*x / (y-z)”,...);

BasicBlock lNewBB = lSD.instantiate(lArg1, ...);

Structure

1. Introduction1. Features absent from nvcc 2. Code compilation3. JASM

2. Library Features1. Dependencies2. Aliasing3. Predication4. Snippets5. Reverse compilation

3. Summary

Dependencies – 1/2

• All contribute to the final instructions stream.– What is the ideal layout?

• What complicates the compiler– Not enough information to distinguish between true

and false dependencies.– Variable-latency instructions

• E.g. coalesced vs non-coalesced accesses.

Dependencies – 2/2

• JASM determines instruction order based on:– Dependency Analysis (DA) modules & Cost Function.

• Full space exploration – no heuristics:– DAs constrain instructions’ motion in the stream.– Cost function estimates execution time for any

stream.– Scheduling done by external constraints solver.

• Only applicable to basic blocks.

Aliasing – 1/3

• PTX is not the final thing.– Further optimised by ptxas before machine code

generation.

• Want to specify exactly what is and what’s not aliased.– No #pragma aliased / disjoint in PTX.

• Goal:– Simplify declaration of aliasing/disjoint accesses.– Handle all memory spaces.

Aliasing – 2/3

• JASM Addressable Memory Regions (AMRs)• An n-ary tree of AMRs where:

– Root nodes represent spaces (e.g. shared, global)– Each node is a memory region and a sub-region of its

parent’s.– Siblings are disjoint regions, collectively making their

parent’s

• Instructions are associated with AMRs.– AMRs predefined for CUDA memory spaces.

Aliasing – 3/3

• Example

• Generally:– No need for pointer grouping (a la “#pragma disjoint”

etc.)– We work with instructions, not pointers.

global mem AMR 01: st.global.f32 [%r0], %f5

02: ld.global.f64 [%r1], %lf8

03: st.global.f32 [%r3], %f4

04: st.global.s32 [%r8], %s1

Predication – 1/3

• Conditional execution– if (foo==0) bar=5;– Thread divergence.

• Predicated execution– setp.eq.s32 %p, foo, 0;@%p mov.s32 bar, 5;

– Non-divergent cycle-burner.

• Fine line between the two.• Cannot predicate code explicitly in CUDA.

Predication – 2/3

• Explicit– Can allocate predicates, modify them and predicate

instructions.

• Example:– Direct:

Register lP = lRegFile.allocate(BT.PRED, “%p1”);// @%p1 mov.s32 bar, 5lBB.append(PMOVS32, lP, lBar, new Immediate(5));

– By reflection:Instruction lT = new InstructionImpl( MOVS32, lBar, new Immediate(5));lBB.append(lT.descriptor().toPredicatedVersion(), lP, lArg1, lArg2);

Predication – 3/3

• Any basic block can be predicated.– Including already-predicated instructions.

• Example:– %p mov.s32 %d, %s; // if (%p) %d = %s;– Predicate by %q

• Output:– @ %q and.pred %t, %p, %q;@!%q mov.pred %t, %q; // i.e. %t = %q ? (%p && %q) : false;@ %t mov.s32 %d, %s;

Snippets – 1/3

• Problem:– Certain tasks require knowing in advance how the

compiler treats a piece of code.

• Software pipelining– template<typename T>vmult( T* aDst, T* aSrc1, T* aSrc2) { for(int i=0 ; i<N ; i++) aDst[i] = aSrc2[i] * aSrc2[i];}

Snippets – 2/3

• Consider H/W vs S/W instructions– Tradeoff between pipeline stall & register pressure.

• However, register pressure:– Is also a function of the # of thread blocks.

• Ideally– Want to generate pipelined code for a variable

number of dependencies.

• Solution:– Encapsulate function in a reflective macro

• Reflect instructions & dependencies.

Snippets – 3/3

• Consider the complex “multiplication”– (a+ib)*(c+id) (ac+bd)+i(ad+bc)– 2 stages: 2 muls, then 2 madds.

• Snippet descriptor organisation:– Group 0:

• mul.f32 ?x, ?a, ?c; mul.f32 ?y, ?a, ?d

– Group 1:• mad.f32 ?x, ?x, ?b, ?d; mad.f32 ?y, ?y, ?b, ?c

– ?* items are parameters.

• Any basic block can be “snippetised”.

Reverse compilation – 1/3

• What to do with legacy CUDA code?– Option: Manually re-write in JASM.

• No. Any PTX file can be loaded in JASM.– Not just loaded in. – Organised in basic blocks within a Control Flow

Graph.– Malleable from thereon like every JASM code.

• Inlined C in JASM– Idea: obtain a snippet from a C function.– Opposite of “inlined assembly in C”.– Why?

• Reuse what nvcc makes available.• Enjoy the benefits that come with snippets.

• At the moment, we can do the following:– Code:

• SnippetDescriptor lSD= CToHLA.obtain( “(x*y) % 3”, “int”, “r”, “int”, “x”, “int”, “y”);

– r is the return parameter; x & y are arguments.– Equivalent to:

• int r = (x*y) % 3;

• Now we can write:– if(lSD.numberOfRegisters() > 5) { ...

Summary

• NVIDIA GPUs– Many self-conflicting parameters affect performance.– Some not nvcc-tunable.

• JASM– Similar to a compiler back-end but programmable

itself.– Different constructs to generate variants of the same

kernel.– Explore the optimisations strategy space faster.

• The optimisations are expressed as a function of the code.

JASM: A Java Library for the Generation and Scheduling of PTX Assembly

Documents

MAN ES PtX

PTX Technology Applications 2

JIA NAN TELECOMMUNICATION INSTRUMENT CO., PTX-BCI PTX …

Pressure and Temperature Transducer (PTX)

Maria de alba paleolitico .ptx

Carol of the Bells - PTX

PTX Presentation July 2015

Northwestern Ptx Elections

PTX 2012_01

Pneumothorax (PTX)

Jasm analisis del desempeño docente

Agitador de Turbinas Operaciones ptx

Новое поколение PTX

עבודת פגייהסיכום.p ptx

Ima lecture ii (2017).ptx

Ocelot: PTX Emulator

MALDI and ESI mass spectra of peptides and proteins - and LC … 01-24... · 2011. 3. 4. · PTX-2'-OH + Succinic Anhydride PTX-2'-O2CCH2CH2CO2H PTX-SX Scheme 2 NH + 2 PTX-SX NH-PTX

PTX Technology_Applications_2

PTX T5208 Manual

2015 Pharmacology & Toxicology Annual Report · PTX 7300 Cellular Molecular Toxicology, Obesity & Diabetes lectures PTX 9220 Scientific Writing 1, co-director PTX 9220 Scientific