Making Sequential Consistency Practical in

Making Sequential Consistency Practical in TitaniumPractical in Titanium

Amir Kamil, Jimmy Su, and Katherine Yelick, y ,Titanium Group

http://titanium.cs.berkeley.edu

U.C. BerkeleyNovember 15 2005November 15, 2005

1SC|05: Practical Sequential Consistency Amir Kamil

Reordering in Sequential ProgramsTwo accesses can be reordered as long as the reordering does not violate a local dependency

Reordering in Sequential Programs

violate a local dependency.

Initially, flag = data = 0

data = 1 flag = 1

flag = 1 data = 1

In both orderings the end result is {data == flag == 1}In both orderings, the end result is {data flag 1}.

2 Amir KamilSC|05: Practical Sequential Consistency

Reordering in Parallel ProgramsIn parallel programs, a reordering can change the semantics even if no local dependencies exist

Reordering in Parallel Programs

if no local dependencies exist.

Initially, flag = data = 0

data = 1 flag = 1

T1 T1

T2 T2g

f = flag

T2

f = flag

T2

flag = 1 data = 1

d = data d = data

g

{f == 1, d == 0} is a possible result in the reordered code, b i h i i l d


but not in the original code.

Memory Models• In relaxed consistency, reordering is

allowed if no local dependencies or

Memory Models

allowed if no local dependencies or synchronization operations are violated

• In sequential consistency a reordering is• In sequential consistency, a reordering is illegal if it can be observed by another threadthread

• Titanium, Java, UPC, and many other languages do not provide sequentiallanguages do not provide sequential consistency due to the (perceived) cost of enforcing itg


Software and Hardware Reordering• Compiler can reorder accesses as part of

an optimization

Software and Hardware Reordering

an optimization• Example: copy propagation• Logical fences inserted where reordering is illegal• Logical fences inserted where reordering is illegal

– optimizations respect these fences• Hardware can reorder accessesHardware can reorder accesses

• Examples: out of order execution, remote accesses

• Fence instructions inserted into generated code –waits until all prior memory operations have completedcompleted

• Can cost a complete round trip time due to remote accesses


accesses

Conflicts• Reordering of an access is observable

only if it conflicts with some other access:

Conflicts

only if it conflicts with some other access:• The accesses can be to the same memory location• At least one access is a write• At least one access is a write• The accesses can run concurrently

T1 T2

data = 1

T1

f = flag

T2

flag = 1 d = data

• Fences only need to be inserted around h fli

Conflicts


accesses that conflict

Sequential Consistency in TitaniumSequential Consistency in Titanium• Minimize number of fences – allow same

ti i ti l d d loptimizations as relaxed model• Concurrency analysis identifies

tconcurrent accesses• Relies on Titanium’s textual barriers and single-

l d ivalued expressions• Alias analysis identifies accesses to the

l tisame location• Relies on SPMD nature of Titanium


Barrier Alignment• Many parallel languages make no attempt

Barrier Alignment

to ensure that barriers line up• Example code that is legal but will deadlock:if (Ti.thisProc() % 2 == 0)

Ti.barrier(); // even ID threadselse

; // odd ID threads


Structural Correctness• Aiken and Gay introduced structural

Structural Correctness

correctness (POPL’98)• Ensures that every thread executes the same

b f b inumber of barriers• Example of structurally correct code:if (Ti thi P () % 2 0)if (Ti.thisProc() % 2 == 0)

Ti.barrier(); // even ID threadslelseTi.barrier(); // odd ID threads


Textual Barrier Alignment• Titanium has textual barriers: all threads

Textual Barrier Alignment

must execute the same textual sequence of barriers• Stronger guarantee than structural correctness –

this example is illegal:if (Ti thi P () % 2 0)if (Ti.thisProc() % 2 == 0)

Ti.barrier(); // even ID threadselseelse

Ti.barrier(); // odd ID threads

Si l l d i d t f• Single-valued expressions used to enforce textual barriers


Single-Valued Expressions• A single-valued expression has the same

l ll th d h l t d

Single-Valued Expressions

value on all threads when evaluated• Example: Ti.numProcs() > 1

• All threads guaranteed to take the same branch of a conditional guarded by a i l l d isingle-valued expression• Only single-valued conditionals may have barriers

E l f l l b i• Example of legal barrier use:if (Ti.numProcs() > 1)

Ti b i () // lti l th dTi.barrier(); // multiple threadselse

// l th d t t l11 Amir KamilSC|05: Practical Sequential Consistency

; // only one thread total

Concurrency Analysis (I)Concurrency Analysis (I)• Graph generated from program as follows:

Node added for each code segment between• Node added for each code segment between barriers and single-valued conditionals

• Edges added to represent control flow betweenEdges added to represent control flow between segments

// code segment 11

if ([single])

// code segment 2 2 3else

// code segment 3 4// code segment 4

Ti.barrier()

// code segment 5 5

barrier


// code segment 5 5

Concurrency Analysis (II)Concurrency Analysis (II)• Two accesses can run concurrently if:

They are in the same node or• They are in the same node, or• One access’s node is reachable from the other

access’s node without hitting a barrieraccess s node without hitting a barrier• Algorithm: remove barrier edges, do DFS

11

2 3

Concurrent Segments1 2 3 4 52 3

4

1 X X X X2 X X X

5

barrier3 X X X4 X X X X5 X


5 5 X

Alias AnalysisAlias Analysis• Allocation sites correspond to abstract

locations (a locs)locations (a-locs)• All explicit and implict program variables

have points to setshave points-to sets• A-locs are typed and have points-to sets

for each field of the corresponding typefor each field of the corresponding type• Arrays have a single points-to set for all indices

• Analysis is flow context insensitive• Analysis is flow,context-insensitive• Experimental call-site sensitive version – doesn’t

seem to help muchseem to help much


Thread-Aware Alias AnalysisThread-Aware Alias Analysis• Two types of abstract locations: local and

remoteremote• Local locations reside in local thread’s memory• Remote locations reside on another thread• Remote locations reside on another thread

• Exploits SPMD property• Results are a summary over all threads• Results are a summary over all threads• Independent of the number of threads at runtime


Alias Analysis: AllocationAlias Analysis: Allocation• Creates new local abstract location

R lt f ll ti t id i l l• Result of allocation must reside in local memory

class Foo {Obj t

A-locs 1, 2Object z;

} Points-to Setsastatic void bar() {

L1: Foo a = new Foo();Foo b = broadcast a from 0;

abc

Foo c = a;L2: a.z = new Object();}

c


}

Alias Analysis: AssignmentAlias Analysis: Assignment• Copies source abstract locations into

points to set of targetpoints-to set of target

class Foo {Obj t

A-locs 1, 2Object z;

} Points-to Setsa 1static void bar() {

L1: Foo a = new Foo();Foo b = broadcast a from 0;

a 1bc 1Foo c = a;

L2: a.z = new Object();}

c 11.z 2


}

Alias Analysis: BroadcastAlias Analysis: Broadcast• Produces both local and remote versions

of source abstract locationof source abstract location• Remote a-loc points to remote analog of what

local a-loc points tolocal a loc points to

class Foo {Obj t

A-locs 1, 2, 1rObject z;} Points-to Sets

a 1

r

static void bar() {L1: Foo a = new Foo();

Foo b = broadcast a from 0;

a 1b 1, 1rc 1Foo c = a;

L2: a.z = new Object();}

c 11.z 21 z 2


} 1r.z 2r

Aliasing ResultsAliasing Results• Two variables A and B may

li ifalias if:∃ x∈pointsTo(A).

Points-to Setsa 1

x∈pointsTo(B)• Two variables A and B may

b 1, 1rc 1y

alias across threads if:∃ x∈pointsTo(A). Alias [Across

Threads]:p ( )R(x)∈pointsTo(B),

(where R(x) is the remote

Threads]: a b, c [b]b a c [a c](where R(x) is the remote

counterpart of x)b a, c [a, c]c a, b [b]


BenchmarksBenchmarksBenchmark Lines1 Descriptionpi 56 Monte Carlo integrationdemv 122 Dense matrix-vector multiplysample-sort 321 Parallel sortlu-fact 420 Dense linear algebra3d ff 614 F i t f3d-fft 614 Fourier transformgsrb 1090 Computational fluid dynamics kernelgsrb* 1099 Slightly modified version of bgsrb* 1099 Slightly modified version of gsrbspmv 1493 Sparse matrix-vector multiplygas 8841 Hyperbolic solver for gas dynamicsgas 8841 Hyperbolic solver for gas dynamics

1 Line counts do not include the reachable portion of the 1 37 000 li Ti i /J 1 0 lib i


1 37,000 line Titanium/Java 1.0 libraries

Analysis Levels• We tested analyses of varying levels of

i i

Analysis Levels

precisionAnalysis Description

naïve All heap accesses

sharing All shared accessessharing All shared accesses

concur Concurrency analysis + type-based AA

concur/saa Concurrency analysis + sequential AA

concur/taa Concurrency analysis + thread-aware AA/ Concurrency analysis thread aware AA

concur/taa/cycle Concurrency analysis + thread-aware AA + cycle detection


cycle detection

Static (Logical) FencesStatic Fence Removal

Static (Logical) Fences

80100120

ge

20406080

Perc

enta

g

0

naïve

shar

ing

conc

uron

cur/s

aa

conc

ur/ta

ar/t

aa/cy

cle

GOODco co

conc

ur/

pi demv sample sort

GOOD

lu fact 3d fft gsrbgsrb* spmv gas

Percentages are for number of static fences reduced over naive22 Amir KamilSC|05: Practical Sequential Consistency

Percentages are for number of static fences reduced over naive

Dynamic (Executed) FencesDynamic Fence Removal

Dynamic (Executed) Fences

80100120

age

20406080

Perc

enta

g

0

naïve

shar

ing

conc

urco

ncur

/saa

conc

ur/ta

ar/t

aa/cy

cle

GOODco co

conc

ur/

pi demv sample sort

GOOD

Percentages are for number of dynamic fences reduced over naive

p plu fact 3d fft gsrbgsrb* spmv gas


Percentages are for number of dynamic fences reduced over naive

Dynamic Fences: gsrbDynamic Fences: gsrb• gsrb relies on dynamic locality checks

li ht difi ti t h k ( )• slight modification to remove checks (gsrb*) greatly increases precision of analysis

gsrb Dynamic Fence Removal

100

120

40

60

80

erce

ntag

e

GOOD0

20

40Pe

naïve

shari

ng

conc

urco

ncur/

saa

conc

ur/taa

ncur/

taa/cy

cle


concgsrb gsrb*

Two Example Optimizations• Consider two optimizations for GAS

Two Example Optimizations

languages1.Overlap bulk memory copies2.Communication aggregation for irregular array

accesses (i.e. a[b[i]])B th ti i ti d• Both optimizations reorder accesses, so sequential consistency can inhibit them

• Both are addressing network performance, so potential payoff is high


Array Copies in TitaniumArray Copies in Titanium• Array copy operations are commonly used

dst.copy(src);

• Content in the domain intersection of the two arrays is copied from dst to src

srcdst

• Communication (possibly with packing) required if arrays reside on different threadsq y

• Processor blocks until the operation is complete.


complete.

Non Blocking Array Copy OptimizationNon-Blocking Array Copy Optimization• Automatically convert blocking array copies y g y p

into non-blocking array copies• Push sync as far down the instruction stream as

possible to allow overlap with computation• Interprocedural: syncs can be moved

across method boundaries• Optimization reorders memory accesses –p y

may be illegal under sequential consistency


Communication Aggregation on IrregularCommunication Aggregation on Irregular Array Accesses (Inspector/Executor)

• A loop containing indirect array accesses is• A loop containing indirect array accesses is split into phases• Inspector examines loop and computes referenceInspector examines loop and computes reference

targets• Required remote data gathered in a bulk operation• Executor uses data to perform actual computation

for ( ) {schd = inspect(remote, b);

for (...) {a[i] = remote[b[i]];// other accesses

}

tmp = get(remote, schd);for (...) {a[i] = tmp[i];

Can be illegal under sequential consistency

} // other accesses}


• Can be illegal under sequential consistency

Relaxed + SC with 3 AnalysesRelaxed + SC with 3 Analyses• We tested performance using analyses of

Name Description

varying levels of precisionp

relaxed Uses Titanium’s relaxed memory model

naïve Uses sequential consistency, puts fences around every heap accessfences around every heap access

sharing Uses sequential consistency, puts fences around every shared heap access

concur/taa/cycle Uses sequential consistency, uses our most aggressive analysis


gg y

Dense Matrix Vector MultiplyDense Matrix Vector MultiplyDense Matrix Vector Multiply

1.5

2

p

0.5

1

spee

du

0

0.5

1 2 4 8 16# of processors

relaxed naive sharing concur/taa/cycle

• Non-blocking array copy optimization applied• Strongest analysis is necessary: other SC


implementations suffer relative to relaxed

Sparse Matrix Vector MultiplySparse Matrix Vector MultiplySparse Matrix Vector Multiply

80

100

40

60

peed

up

0

20

sp

1 2 4 8 16# of processors

relaxed naive sharing concur/taa/cyclerelaxed naive sharing concur/taa/cycle

• Inspector/executor optimization appliedSt t l i i i d ffi i t


• Strongest analysis is again necessary and sufficient

ConclusionConclusion• Titanium’s textual barriers and single-

l d i ll f i l b tvalued expressions allow for simple but precise concurrency analysisS ti l i t li i t• Sequential consistency can eliminate nearly all fences for the benchmarks t t dtested

• On two linear algebra kernels, sequential i t b id d ith littlconsistency can be provided with little or

no performance cost with our analysisA l i ll h i i i b• Analysis allows the same optimizations to be performed as in the relaxed memory model


Documents

Making Sequential Consistency Practical in