Upload
eleanore-fletcher
View
221
Download
3
Embed Size (px)
Citation preview
Rethinking Parallel Execution
Guri Sohi(along with Matthew Allen, Srinath
Sridharan, Gagan Gupta)University of Wisconsin-Madison
Outline
• Reminiscing: Instruction Level Parallelism (ILP)• Canonical parallel processing and execution• Rethinking canonical parallel execution• Dynamic Serialization• Some results
November 30, 2009 Workshop on Deterministic Execution 3
Reminiscing about ILP
• Late 1980s to mid 1990s• Search for “post RISC” architecture
– More accurately, instruction processing model
• Desire to do more than one instruction per cycle—exploit ILP
• Majority school of thought: VLIW/EPIC• Minority: out-of-order (OOO) superscalar
4
VLIW/EPIC School
• Parallel execution requires a parallel ISA• Parallel execution determined statically (by
compiler)• Parallel execution expressed in static program
• Take program/algorithm parallelism and mold it to given execution schedule for exploiting parallelism
5
VLIW/EPIC School
• Creating effective parallel representations (statically) introduces several problems– Predication– Statically scheduling loads– Exception handling– Recovery code
• Lots of research addressing these problems
6
OOO Superscalar
• Create dynamic parallel execution from sequential static representation– dynamic dependence information accurate– execution schedule flexible
• None of the problems associated with trying to create a parallel representation statically
7
The Multicore Generation
• How to achieve parallel execution on multiple processors?
• Over four decades of conventional wisdom in parallel processing– Mostly in the scientific application/HPC arena– Use this as basis
Parallel Execution Requires a Parallel Representation
8
Canonical Parallel Execution Model
A: Analyze program to identify independence in program– independent portions executed in parallel
B: Create static representation of independence– synchronization to satisfy independence
assumptionC: Dynamic parallel execution unwinds as per
static representation– potential consequences due to static assumptions
9
Canonical Parallel Execution Model
• Like VLIW/EPIC, canonical model creates a variety of problems that have lead to a vast body of research– identifying independence– creating static representation– dynamic unwinding
November 30, 2009 Workshop on Deterministic Execution 10
Control and Data-Driven
• Parallelism is due to operations on disjoint sets of data
• Static representation typically control-driven– Most if not all practical programming languages– Need to ensure that execution is on disjoint data
• Use synchronization to ensure• But nature of data revealed dynamically
– Potentially obscures application parallelism• Conflates parallelism with execution schedule
November 30, 2009 Workshop on Deterministic Execution 11
Control and Data-Driven
• Data-driven focuses on data dependence– Naturally separates operations on disjoint data– Can be easily derived from total (sequential) order
• Remember VLIW (control driven and parallel) and OOO superscalar (data-driven from sequential)
• My view: data-driven models much more powerful and practical than control-driven– How to get such a model for multicore?
November 30, 2009 Workshop on Deterministic Execution 12
Static Program RepresentationIssues Sequential ParallelBugs Yes Yes (more)
Data races No Yes
Locks/Synch No Yes
Deadlock No Yes
Nondeterminism No Yes
Parallel Execution ? Yes
November 30, 2009 Workshop on Deterministic Execution 13
• Can we get parallel execution without a parallel representation? Yes
• Can dynamic parallelization extract parallelism that is inaccessible to static methods? Yes
Dynamic Serialization: What?
• Data-driven parallel execution from sequential program– Data-centric (dynamic) expression of dependence– Determinate, race-free execution– No locks and no explicit synchronization– Easier to write, debug, and maintain– No speculation a la TLS or TM– Comparable or better performance than
conventional parallel models
November 30, 2009 Workshop on Deterministic Execution 14
How? Big Picture• Write program in well object-oriented style
– Method operates on data of associated object (ver. 1)• Identify parts of program for potential parallel
execution– Make suitable annotations as needed– Don’t impose how parallelism is “executed”
• Dynamically determine data object touched by selected code– Identify dependence
• Program thread assigns selected code to bins– in a determined (sequential) order
November 30, 2009 Workshop on Deterministic Execution 15
How? Big Picture• Serialize computations to same object
– Enforce dependence– Assign them to same bin; delegate thread executes
computations in same bin sequentially• Do not look for/represent independence
– Falls out as an effect of enforcing dependence– Computations in different bins execute in parallel
• Updates to given state in same order as in sequential program– Determinism– No races– If sequential correct; parallel execution is correct (same input)
November 30, 2009 Workshop on Deterministic Execution 16
Big Picture
February 16, 2009 PPoPP 2009 17
Program Thread
Delegate Thread 0
Delegate Thread 2
Delegate Thread 1
Ex delegate assign:SS % NUM_THREADS
Serialization Sets: How?• Sequential program with annotations
– Identify potentially independent methods– Associate a serializers with objects to express
dependence
• Serializer groups dependent method invocations into a serialization set– Runtime executes in order to honor dependences
• Independent method invocations in different sets– Runtime opportunistically parallelizes execution
November 30, 2009 Workshop on Deterministic Execution 18
Serializers
• Parallel construct associated with data– Parallelism is inherently dynamic
• Serializer operations:– Delegate: enqueues a method invocation for
potentially parallel execution• Invocations executed by runtime in the order they are
delegated
– Synchronize: wait for all outstanding methods to complete
November 30, 2009 Workshop on Deterministic Execution 19
Units of Parallelism
• Potentially independent methods– Modify only data owned by object– Fields / Data members– Pointers to non-shared data– Consistent with widespread OO practices and idioms
• Modularity, encapsulation, information hiding
• Modifying methods for independence– Store return value in object, retrieve with accessor– Copy pointer data
November 30, 2009 Workshop on Deterministic Execution 20
Prometheus: C++ Library for SS
• Template library– Compile-time instantiation of SS data structures– Metaprogramming for static type checking
• Runtime orchestrates parallel execution
• Portable– x86, x86_64, SPARC V9– Linux, Solaris
November 30, 2009 Workshop on Deterministic Execution 21
Additional Prometheus Features
• Doall for parallel loops– For arrays of different objects– Parallel delegation
• Shared state via reductions– Operates on local copy of state– For associative operations
• Pipeline template for pipeline-parallel codes– Serializers work well for pipeline parallelism
November 30, 2009 Workshop on Deterministic Execution 22
Prometheus Runtime
• Version 1.0– Dynamically extracts parallelism– Statically scheduled– No nested parallelism
• Version 2.0– Dynamically extracts parallelism– Dynamically scheduled
• Work-stealing scheduler
– Supports nested parallelismNovember 30, 2009 Workshop on Deterministic Execution 23
Statically Scheduled Results
November 30, 2009 Workshop on Deterministic Execution 25
4 Socket AMD Barcelona (4-way multicore) = 16 total cores
Conclusions
• Sequential program with annotations– No explicit synchronization, no locks
• Programmers focus on keeping computation private to object state– Consistent with OO programming practices
• Dependence-based model– Determinate race-free parallel execution
• Performance similar or better than multithreading
November 30, 2009 Workshop on Deterministic Execution 28
Comments
• Applications we have shown have “natural parallelism”– Yes, for parallel execution you need parallelism in
the application/algorithm• Parallelism may manifest only dynamically
– Other models require that it be found statically– And then made to unwind in a fixed manner
• For suitable “grain size” our approach will get parallel execution where others can’t
November 30, 2009 Workshop on Deterministic Execution 29
Related Work• Actors / Active Objects
– Hewitt [JAI 1977]• MultiLisp
– Halstead [ACM TOPLAS 1985]• Inspector-Executor
– Wu et al. [ICPP 1991]• Jade
– Rinard and Lam [ACM TOPLAS 1998]• Cilk
– Frigo et al. [PLDI 1998]• OpenMP• Apple Grand Central DispatchNovember 30, 2009 Workshop on Deterministic Execution 30
Example: Debit/Credit Transactions
November 30, 2009 Workshop on Deterministic Execution 32
trans_t* trans;while ((trans = get_trans ()) != NULL) {
account_t* account = trans->account;
if (trans->type == DEPOSIT) account->deposit (trans->amount);
else if (trans->type == WITHDRAW) account->withdraw (trans->amount);}
Several static unknowns!
# of transactions?
Points to?
Loop-carried dependence?
Multithreading Strategy
November 30, 2009 Workshop on Deterministic Execution 33
trans_t* trans;while ((trans = get_trans ()) != NULL) {
account_t* account = trans[i]->account;
if (trans->type == DEPOSIT) account->deposit (trans->amount);
else if (trans->type == WITHDRAW) account->withdraw (trans->amount);}
1) Read all transactions into an array2) Divide chunks of array among multiple threads
Oblivious to what accounts each thread may access!→ Methods must lock account to→ ensure mutual exclusion
Serializers
• Serializer per object for independent methods
• Share a serializer for inter-dependent methods
November 30, 2009 Workshop on Deterministic Execution 34
base
acct
base
acct
serializerWithdraw
$1000
serializerDeposit
$50
Deposit$5000
Deposit$300
base
acct
base
acct
serializerTransfer
$500
Adding Serializers to Objects
November 30, 2009 Workshop on Deterministic Execution 35
class account_t
{ private: float balance;
public: account_t (float balance) : balance (balance) {}
void deposit (float amount); void withdraw (float amount);};
Adding Serializers to Objects
November 30, 2009 Workshop on Deterministic Execution 36
class account_t : public private_base_t{ private: float balance;
public: account_t (float balance) : private_base_t (new serializer_t), balance (balance) {}
void deposit (float amount); void withdraw (float amount);};
Base class has pointer to serializer
Construct this object with a new serializer
Wrapper Templates
• Wrappers perform implicit synchronization:typedef private<account_t> private_account_t;private_account_t account;
• Interface has two primary methods– delegate for potentially independent methodsaccount.deposit (amount); account.delegate (deposit, amount);
– call for dependent methodsfloat amount = account.get_balance ();float amount = account.call (get_balance);
November 30, 2009 Workshop on Deterministic Execution 37
Enqueue a deposit for amount on serializer of this account
Synchronize the serializer for this account, then get balance
private <account_t> private_account_t;
begin_nest ();trans_t* trans;while ((trans = get_trans ()) != NULL) { private_account_t* account = trans->account;
if (trans->type == DEPOSIT) account->delegate(deposit, trans->amount);
else if (trans->type == WITHDRAW) account->delegate(withdraw, trans->amount);}end_nest ();
End nesting level, implicit barrier
Example with Serialization Sets
November 30, 2009 Workshop on Deterministic Execution 38
Declare wrapped account type
Initiate nesting level
Delegate indicates potentially-independent operations
At execution, delegate:1)Creates method invocation structure2)Gets serializer pointer from base class3)Enqueues invocation in serialization set
delegate
delegate
delegate
delegate
delegate
delegate
delegate
delegate
November 30, 2009 Workshop on Deterministic Execution 39
depositacct=100
$2000
SS #100 SS #200 SS #300
withdrawacct=300
$350
withdrawacct=200
$1000withdrawacct=100
$50deposit
acct=300$5000
withdrawacct=100
$20
withdrawacct=200
$1000deposit
acct=100$300
Program context
Delegate context
Program thread
Delegate threadsProgram context
November 30, 2009 Workshop on Deterministic Execution 40
depositacct=100
$2000
SS #100 SS #200 SS #300
withdrawacct=300
$350
withdrawacct=200
$1000withdrawacct=100
$50deposit
acct=300$5000
withdrawacct=100
$20
withdrawacct=200
$1000deposit
acct=100$300
Delegate contextDelegate 0 Delegate 1
depositacct=100
$2000
withdrawacct=100
$50
withdrawacct=100
$20
depositacct=100
$300
withdrawacct=200
$1000
withdrawacct=300
$350
depositacct=300
$5000
withdrawacct=200
$1000
delegate
delegate
delegate
delegate
delegate
delegate
delegate
delegate
Race-free, determinate execution without synchronization!
Network Packet Classification
41
packet_t* packet;classify_t* classifier;vector<int> ruleCount(num_rules);Vector<packet_queue_t> packet_queues;int packetCount = 0;
for(i=0;i<packet_queues.size();i++){
while ((packet = packet_queues[i].get_pkt()) != NULL){ruleID = classifier->softClassify (packet);ruleCount[ruleID]++;packetCount++;
}}
Example with Serialization Sets
42
Private <classify_t> private_classify_t;vector<private_classify_t> classifiers;int packetCount = 0;vector<int> ruleCount(numRules,0);int size = packet_queues.size();begin_nest ();for (i=0;i<size;i++){
classifiers[i].delegate (&classifier_t::softClassify,
packet_queues[i]);}end_nest ();
for(i=0;i<size;i++){ruleCount += classifier[i].getRuleCount();packetCount += classifier[i].getPacketCount();
}
Evaluation Methodology
• Benchmarks– Lonestar, NU-MineBench, PARSEC, Phoenix
• Conventional Parallelization– pthreads, OpenMP
• Prometheus versions– Port program to sequential C++ program– Idiomatic C++: OO, inheritance, STL– Parallelize with serialization sets
November 30, 2009 Workshop on Deterministic Execution 43