Rethinking Parallel Execution Guri Sohi (along with Matthew Allen, Srinath Sridharan, Gagan Gupta) University of Wisconsin-Madison

Rethinking Parallel Execution

Guri Sohi(along with Matthew Allen, Srinath

Sridharan, Gagan Gupta)University of Wisconsin-Madison

November 4, 2009 Talk at Northwestern University 2

Outline

• Reminiscing: Instruction Level Parallelism (ILP)• Canonical parallel processing and execution• Rethinking canonical parallel execution• Dynamic Serialization• Some results

November 30, 2009 Workshop on Deterministic Execution 3

Reminiscing about ILP

• Late 1980s to mid 1990s• Search for “post RISC” architecture

– More accurately, instruction processing model

• Desire to do more than one instruction per cycle—exploit ILP

• Majority school of thought: VLIW/EPIC• Minority: out-of-order (OOO) superscalar

4

VLIW/EPIC School

• Parallel execution requires a parallel ISA• Parallel execution determined statically (by

compiler)• Parallel execution expressed in static program

• Take program/algorithm parallelism and mold it to given execution schedule for exploiting parallelism

5

VLIW/EPIC School

• Creating effective parallel representations (statically) introduces several problems– Predication– Statically scheduling loads– Exception handling– Recovery code

• Lots of research addressing these problems

6

OOO Superscalar

• Create dynamic parallel execution from sequential static representation– dynamic dependence information accurate– execution schedule flexible

• None of the problems associated with trying to create a parallel representation statically

7

The Multicore Generation

• How to achieve parallel execution on multiple processors?

• Over four decades of conventional wisdom in parallel processing– Mostly in the scientific application/HPC arena– Use this as basis

Parallel Execution Requires a Parallel Representation

8

Canonical Parallel Execution Model

A: Analyze program to identify independence in program– independent portions executed in parallel

B: Create static representation of independence– synchronization to satisfy independence

assumptionC: Dynamic parallel execution unwinds as per

static representation– potential consequences due to static assumptions

9

Canonical Parallel Execution Model

• Like VLIW/EPIC, canonical model creates a variety of problems that have lead to a vast body of research– identifying independence– creating static representation– dynamic unwinding


Control and Data-Driven

• Parallelism is due to operations on disjoint sets of data

• Static representation typically control-driven– Most if not all practical programming languages– Need to ensure that execution is on disjoint data

• Use synchronization to ensure• But nature of data revealed dynamically

– Potentially obscures application parallelism• Conflates parallelism with execution schedule


Control and Data-Driven

• Data-driven focuses on data dependence– Naturally separates operations on disjoint data– Can be easily derived from total (sequential) order

• Remember VLIW (control driven and parallel) and OOO superscalar (data-driven from sequential)

• My view: data-driven models much more powerful and practical than control-driven– How to get such a model for multicore?


Static Program RepresentationIssues Sequential ParallelBugs Yes Yes (more)

Data races No Yes

Locks/Synch No Yes

Deadlock No Yes

Nondeterminism No Yes

Parallel Execution ? Yes


• Can we get parallel execution without a parallel representation? Yes

• Can dynamic parallelization extract parallelism that is inaccessible to static methods? Yes

Dynamic Serialization: What?

• Data-driven parallel execution from sequential program– Data-centric (dynamic) expression of dependence– Determinate, race-free execution– No locks and no explicit synchronization– Easier to write, debug, and maintain– No speculation a la TLS or TM– Comparable or better performance than

conventional parallel models


How? Big Picture• Write program in well object-oriented style

– Method operates on data of associated object (ver. 1)• Identify parts of program for potential parallel

execution– Make suitable annotations as needed– Don’t impose how parallelism is “executed”

• Dynamically determine data object touched by selected code– Identify dependence

• Program thread assigns selected code to bins– in a determined (sequential) order


How? Big Picture• Serialize computations to same object

– Enforce dependence– Assign them to same bin; delegate thread executes

computations in same bin sequentially• Do not look for/represent independence

– Falls out as an effect of enforcing dependence– Computations in different bins execute in parallel

• Updates to given state in same order as in sequential program– Determinism– No races– If sequential correct; parallel execution is correct (same input)


Big Picture

February 16, 2009 PPoPP 2009 17

Program Thread

Delegate Thread 0

Delegate Thread 2

Delegate Thread 1

Ex delegate assign:SS % NUM_THREADS

Serialization Sets: How?• Sequential program with annotations

– Identify potentially independent methods– Associate a serializers with objects to express

dependence

• Serializer groups dependent method invocations into a serialization set– Runtime executes in order to honor dependences

• Independent method invocations in different sets– Runtime opportunistically parallelizes execution


Serializers

• Parallel construct associated with data– Parallelism is inherently dynamic

• Serializer operations:– Delegate: enqueues a method invocation for

potentially parallel execution• Invocations executed by runtime in the order they are

delegated

– Synchronize: wait for all outstanding methods to complete


Units of Parallelism

• Potentially independent methods– Modify only data owned by object– Fields / Data members– Pointers to non-shared data– Consistent with widespread OO practices and idioms

• Modularity, encapsulation, information hiding

• Modifying methods for independence– Store return value in object, retrieve with accessor– Copy pointer data


Prometheus: C++ Library for SS

• Template library– Compile-time instantiation of SS data structures– Metaprogramming for static type checking

• Runtime orchestrates parallel execution

• Portable– x86, x86_64, SPARC V9– Linux, Solaris


Additional Prometheus Features

• Doall for parallel loops– For arrays of different objects– Parallel delegation

• Shared state via reductions– Operates on local copy of state– For associative operations

• Pipeline template for pipeline-parallel codes– Serializers work well for pipeline parallelism


Prometheus Runtime

• Version 1.0– Dynamically extracts parallelism– Statically scheduled– No nested parallelism

• Version 2.0– Dynamically extracts parallelism– Dynamically scheduled

• Work-stealing scheduler

– Supports nested parallelismNovember 30, 2009 Workshop on Deterministic Execution 23

Packet Classification(No Locks!)

24

Statically Scheduled Results


4 Socket AMD Barcelona (4-way multicore) = 16 total cores

Statically Scheduled Results


Dynamically Scheduled Results


Conclusions

• Sequential program with annotations– No explicit synchronization, no locks

• Programmers focus on keeping computation private to object state– Consistent with OO programming practices

• Dependence-based model– Determinate race-free parallel execution

• Performance similar or better than multithreading


Comments

• Applications we have shown have “natural parallelism”– Yes, for parallel execution you need parallelism in

the application/algorithm• Parallelism may manifest only dynamically

– Other models require that it be found statically– And then made to unwind in a fixed manner

• For suitable “grain size” our approach will get parallel execution where others can’t


Related Work• Actors / Active Objects

– Hewitt [JAI 1977]• MultiLisp

– Halstead [ACM TOPLAS 1985]• Inspector-Executor

– Wu et al. [ICPP 1991]• Jade

– Rinard and Lam [ACM TOPLAS 1998]• Cilk

– Frigo et al. [PLDI 1998]• OpenMP• Apple Grand Central DispatchNovember 30, 2009 Workshop on Deterministic Execution 30

Questions?


Example: Debit/Credit Transactions


trans_t* trans;while ((trans = get_trans ()) != NULL) {

account_t* account = trans->account;

if (trans->type == DEPOSIT) account->deposit (trans->amount);

else if (trans->type == WITHDRAW) account->withdraw (trans->amount);}

Several static unknowns!

# of transactions?

Points to?

Loop-carried dependence?

Multithreading Strategy


trans_t* trans;while ((trans = get_trans ()) != NULL) {

account_t* account = trans[i]->account;

if (trans->type == DEPOSIT) account->deposit (trans->amount);

else if (trans->type == WITHDRAW) account->withdraw (trans->amount);}

1) Read all transactions into an array2) Divide chunks of array among multiple threads

Oblivious to what accounts each thread may access!→ Methods must lock account to→ ensure mutual exclusion

Serializers

• Serializer per object for independent methods

• Share a serializer for inter-dependent methods


base

acct

base

acct

serializerWithdraw

$1000

serializerDeposit

$50

Deposit$5000

Deposit$300

base

acct

base

acct

serializerTransfer

$500

Adding Serializers to Objects


class account_t

{ private: float balance;

public: account_t (float balance) : balance (balance) {}

void deposit (float amount); void withdraw (float amount);};

Adding Serializers to Objects


class account_t : public private_base_t{ private: float balance;

public: account_t (float balance) : private_base_t (new serializer_t), balance (balance) {}

void deposit (float amount); void withdraw (float amount);};

Base class has pointer to serializer

Construct this object with a new serializer

Wrapper Templates

• Wrappers perform implicit synchronization:typedef private<account_t> private_account_t;private_account_t account;

• Interface has two primary methods– delegate for potentially independent methodsaccount.deposit (amount); account.delegate (deposit, amount);

– call for dependent methodsfloat amount = account.get_balance ();float amount = account.call (get_balance);


Enqueue a deposit for amount on serializer of this account

Synchronize the serializer for this account, then get balance

private <account_t> private_account_t;

begin_nest ();trans_t* trans;while ((trans = get_trans ()) != NULL) { private_account_t* account = trans->account;

if (trans->type == DEPOSIT) account->delegate(deposit, trans->amount);

else if (trans->type == WITHDRAW) account->delegate(withdraw, trans->amount);}end_nest ();

End nesting level, implicit barrier

Example with Serialization Sets


Declare wrapped account type

Initiate nesting level

Delegate indicates potentially-independent operations

At execution, delegate:1)Creates method invocation structure2)Gets serializer pointer from base class3)Enqueues invocation in serialization set

delegate

delegate

delegate

delegate

delegate

delegate

delegate

delegate


depositacct=100

$2000

SS #100 SS #200 SS #300

withdrawacct=300

$350

withdrawacct=200

$1000withdrawacct=100

$50deposit

acct=300$5000

withdrawacct=100

$20

withdrawacct=200

$1000deposit

acct=100$300

Program context

Delegate context

Program thread

Delegate threadsProgram context


depositacct=100

$2000

SS #100 SS #200 SS #300

withdrawacct=300

$350

withdrawacct=200

$1000withdrawacct=100

$50deposit

acct=300$5000

withdrawacct=100

$20

withdrawacct=200

$1000deposit

acct=100$300

Delegate contextDelegate 0 Delegate 1

depositacct=100

$2000

withdrawacct=100

$50

withdrawacct=100

$20

depositacct=100

$300

withdrawacct=200

$1000

withdrawacct=300

$350

depositacct=300

$5000

withdrawacct=200

$1000

delegate

delegate

delegate

delegate

delegate

delegate

delegate

delegate

Race-free, determinate execution without synchronization!

Network Packet Classification

41

packet_t* packet;classify_t* classifier;vector<int> ruleCount(num_rules);Vector<packet_queue_t> packet_queues;int packetCount = 0;

for(i=0;i<packet_queues.size();i++){

while ((packet = packet_queues[i].get_pkt()) != NULL){ruleID = classifier->softClassify (packet);ruleCount[ruleID]++;packetCount++;

}}

Example with Serialization Sets

42

Private <classify_t> private_classify_t;vector<private_classify_t> classifiers;int packetCount = 0;vector<int> ruleCount(numRules,0);int size = packet_queues.size();begin_nest ();for (i=0;i<size;i++){

classifiers[i].delegate (&classifier_t::softClassify,

packet_queues[i]);}end_nest ();

for(i=0;i<size;i++){ruleCount += classifier[i].getRuleCount();packetCount += classifier[i].getPacketCount();

}

Evaluation Methodology

• Benchmarks– Lonestar, NU-MineBench, PARSEC, Phoenix

• Conventional Parallelization– pthreads, OpenMP

• Prometheus versions– Port program to sequential C++ program– Idiomatic C++: OO, inheritance, STL– Parallelize with serialization sets


Documents

Rethinking Parallel Execution Guri Sohi (along with Matthew Allen, Srinath Sridharan, Gagan Gupta) University of Wisconsin-Madison