View
214
Download
0
Tags:
Embed Size (px)
Citation preview
CX: A Scalable, Robust Network for Parallel Computing
Peter Cappello & Dimitrios Mourloukos
Computer Science
UCSB
2
Outline
1. Introduction
2. Related work
3. API
4. Architecture
5. Experimental results
6. Current & future work
3
Introduction
• “Listen to the technology!” Carver Mead
4
Introduction
• “Listen to the technology!” Carver Mead
• What is the technology telling us?
5
Introduction
• “Listen to the technology!” Carver Mead
• What is the technology telling us?
– Internet’s idle cycles/sec growing rapidly
6
Introduction
• “Listen to the technology!” Carver Mead
• What is the technology telling us?
– Internet’s idle cycles/sec growing rapidly
– Bandwidth increasing & getting cheaper
7
Introduction
• “Listen to the technology!” Carver Mead
• What is the technology telling us?
– Internet’s idle cycles/sec growing rapidly
– Bandwidth is increasing & getting cheaper
– Communication latency is not decreasing
8
Introduction
• “Listen to the technology!” Carver Mead
• What is the technology telling us?
– Internet’s idle cycles/sec growing rapidly
– Bandwidth increasing & getting cheaper
– Communication latency is not decreasing
– Human technology is getting neither
cheaper nor faster.
9
Introduction
Project Goals
1. Minimize job completion time
despite large communication latency
10
Introduction
Project Goals
1. Minimize job completion time
despite large communication latency
2. Jobs complete with high probability
despite faulty components
11
Introduction
Project Goals
1. Minimize job completion time
despite large communication latency
2. Jobs complete with high probability
despite faulty components
3. Application program is oblivious to:• Number of processors
• Inter-process communication
• Fault tolerance
12
Heterogeneous machine/OS
Introduction
Fundamental Issue: Heterogeneity
M1
OS1
M2
OS2
M3
OS3
M4
OS4
M5
OS5…
13
Heterogeneous machine/OS
Introduction
Fundamental Issue: Heterogeneity
M1
OS1
M2
OS2
M3
OS3
M4
OS4
M5
OS5…
Functionally Homogeneous
JVM
14
Outline
1. Introduction
2. Related work
3. API
4. Architecture
5. Experimental results
6. Current & future work
15
Related work
• Cilk Cilk-NOW Atlas
– DAG computational model
– Work-stealing
16
Related work
• Linda Piranha JavaSpaces
– Space-based coordination
– Decoupled communication
17
Related work
• Charlotte (Milan project / Calypso prototype)
– High performance Fault tolerance not
achieved via transactions
– Fault tolerance via eager scheduling
18
Related work
• SuperWeb JavelinJavelin++– Architecture: client, broker, host
19
Outline
1. Introduction
2. Related work
3. API
4. Architecture
5. Experimental results
6. Current & future work
20
API
DAG Computational model
int f( int n )
{
if ( n < 2 )
return n;
else
return f( n-1 ) + f( n-2 );
}
21
DAG Computational Model
int f( int n ) {
if ( n < 2 ) return n;
else return f( n-1 ) + f( n-2 );
}
f(4)
Method invocation tree
22
DAG Computational Model
int f( int n ) {
if ( n < 2 ) return n;
else return f( n-1 ) + f( n-2 );
}
f(4)
f(3) f(2)
Method invocation tree
23
DAG Computational Model
int f( int n ) {
if ( n < 2 ) return n;
else return f( n-1 ) + f( n-2 );
}
f(4)
f(3) f(2)
f(2) f(1) f(1) f(0)
Method invocation tree
24
DAG Computational Model
int f( int n ) {
if ( n < 2 ) return n;
else return f( n-1 ) + f( n-2 );
}
f(4)
f(3) f(2)
f(1) f(1) f(0)
f(1) f(0)
Method invocation tree
f(2)
25
DAG Computational Model / API
f(4) execute( ) {
if ( n < 2 )
setArg( , n );
else {
spawn ( );
spawn ( );
spawn ( );
}
}
_______________________________
f(n-1)
+
+
execute( ) {
setArg( , in[0] + in[1] );
}
f(n)
+
+
f(n-2)
26
DAG Computational Model / API
execute( ) {
setArg( , in[0] + in[1] );
}
+
+
f(4)
f(3) f(2)
+
execute( ) {
if ( n < 2 )
setArg( , n );
else {
spawn ( );
spawn ( );
spawn ( );
}
}
_______________________________
f(n-1)
+
+
f(n)
f(n-2)
27
DAG Computational Model / API
execute( ) {
setArg( , in[0] + in[1] );
}
+
+
f(4)
f(3) f(2)
+
f(2) f(1) f(1) f(0)
+
+
execute( ) {
if ( n < 2 )
setArg( , n );
else {
spawn ( );
spawn ( );
spawn ( );
}
}
_______________________________
f(n-1)
+
+
f(n)
f(n-2)
28
DAG Computational Model / API
execute( ) {
setArg( , in[0] + in[1] );
}
+
+
f(4)
f(3) f(2)
+
f(2) f(1) f(1) f(0)
+
+
f(1) f(0)
+
execute( ) {
if ( n < 2 )
setArg( , n );
else {
spawn ( );
spawn ( );
spawn ( );
}
}
_______________________________
f(n-1)
+
+
f(n)
f(n-2)
29
Outline
1. Introduction
2. Related work
3. API
4. Architecture
5. Experimental results
6. Current & future work
30
Architecture: Basic Entities
CONSUMERPRODUCTION
NETWORK
CLUSTERNETWORK
register ( spawn | getResult )* unregister
31
Architecture: Cluster
TASKSERVERPRODUCER
PRODUCER
PRODUCER
PRODUCER
32
A Cluster at Work
f(4)
f(3) f(2)
+
f(2) f(1) f(1) f(0)
f(1) f(0)
+
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
33
A Cluster at Work
f(4)
TASKSERVER
PRODUCER
PRODUCER WAITING
READYf(4)
34
A Cluster at Work
f(4)
TASKSERVER
PRODUCER
PRODUCER WAITING
READYf(4) f(4)
35
A Cluster at Work
f(4)
TASKSERVER
PRODUCER
PRODUCER WAITING
READYf(4)
36
Decompose
execute( )
{
if ( n < 2 )
setArg( ArgAddr, n );
else
{
spawn ( + );
spawn ( f(n-1) );
spawn ( f(n-2) );
}
}
37
A Cluster at Work
f(4)
f(3) f(2)
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
f(4)
+
f(3)
f(2)
38
A Cluster at Work
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
+
f(3)
f(2)
f(3) f(2)
+
39
A Cluster at Work
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
+
f(3)
f(2)
f(3)
f(2)
f(3) f(2)
+
40
A Cluster at Work
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
+
f(3)
f(2)
f(3) f(2)
+
41
A Cluster at Work
f(3) f(2)
+
f(2) f(1) f(1) f(0)
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
+
f(3)
f(2) +
f(2)
f(1)
+
f(1) f(0)
42
A Cluster at Work
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
++
f(2)
f(1)
+
f(1) f(0)
+
f(2) f(1) f(1) f(0)
+
+
43
A Cluster at Work
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
++
f(2)
f(1) +
f(1) f(0)
f(2)
f(1)
+
f(2) f(1) f(1) f(0)
+
+
44
A Cluster at Work
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
++f(1) +
f(0)
f(2)
f(1)
+
f(2) f(1) f(1) f(0)
+
+
45
Compute Base Case
execute( )
{
if ( n < 2 )
setArg( ArgAddr, n );
else
{
spawn ( + );
spawn ( f(n-1) );
spawn ( f(n-2) );
}
}
46
A Cluster at Work
+
f(2) f(1) f(1) f(0)
f(1) f(0)
+
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
++f(1) +
f(0)
f(2)
f(1)
+
f(1)
f(0)
47
A Cluster at Work
+
f(1) f(0)
f(1) f(0)
+
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
+++
f(0)f(1)
+
f(1)
f(0)
48
A Cluster at Work
+
f(1) f(0)
f(1) f(0)
+
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
+++
f(0)f(1)
+
f(1)
f(0)f(1)
f(0)
49
A Cluster at Work
+
f(1) f(0)
f(1) f(0)
+
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
+++
+
f(1)
f(0)f(1)
f(0)
50
A Cluster at Work
+
f(1) f(0)
f(1) f(0)
+
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
+++
+
f(1)
f(0)f(1)
f(0)
51
A Cluster at Work
+
f(1) f(0)
+
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
+++
+
f(1)
f(0)
+
52
A Cluster at Work
+
f(1) f(0)
+
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
++
+
f(1)
f(0)
+
53
A Cluster at Work
+
f(1) f(0)
+
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
++
+
f(1)
f(0)
+
+
f(1)
54
A Cluster at Work
+
f(1) f(0)
+
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
++
+
f(0)
+
f(1)
55
Compose
execute( )
{
setArg( ArgAddr, in[0] + in[1] );
}
56
A Cluster at Work
+
f(1) f(0)
+
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
++
+
f(0)
+
f(1)
57
A Cluster at Work
+
f(0)
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
++
+
f(0)
58
A Cluster at Work
+
f(0)
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
++
+
f(0)
f(0)
59
A Cluster at Work
+
f(0)
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
++
+
f(0)
60
A Cluster at Work
+
f(0)
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
++
+
f(0)
61
A Cluster at Work
+
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
++
+
+
62
A Cluster at Work
+
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
++
+
63
A Cluster at Work
+
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
++
+
+
64
A Cluster at Work
+
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
+++
65
A Cluster at Work
+
+
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
+++
66
A Cluster at Work
++
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
++
+
67
A Cluster at Work
++
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
+
+
68
A Cluster at Work
++
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
+
+
+
69
A Cluster at Work
++
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
+
+
70
A Cluster at Work
++
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
+
+
71
A Cluster at Work
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
+
+
72
A Cluster at Work
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY+
73
A Cluster at Work
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY+
+
74
A Cluster at Work
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY+
75
A Cluster at Work
+
TASKSERVER
PRODUCER
PRODUCER WAITING
READY+
R
76
A Cluster at Work
TASKSERVER
PRODUCER
PRODUCER WAITING
READY
R
1. Result object is sent to Production Network
2. Production Network returns it to Consumer
77
Task Server ProxyOverlap Communication with Computation
PRODUCER
Task Server Proxy
OUTBOX
INBOXCOMMCOMP
READY
WAITING
TASK SERVER
PRIORITY Q
78
Architecture Work stealing & eager scheduling
• A task is removed from server only after a
complete signal is received.
• A task may be assigned to multiple producers
– Balance task load among producers of varying
processor speeds
– Tasks on failed/retreating producers are re-assigned.
79
Architecture: Scalability
• A cluster tolerates producer:
– Retreat
– Failure
• 1 task server however is a:
– Bottleneck
– Single point of failure.
• We introduce a network of task servers.
80
Scalability: Class loading
1. CX class loader loads classes (Consumer JAR) in each server’s class cache
2. Producer loads classes from its server
81
Scalability: Fault-tolerance
Replicate a server’s tasks on its sibling
82
Scalability: Fault-tolerance
Replicate a server’s tasks on its sibling
83
Scalability: Fault-tolerance
Replicate a server’s tasks on its sibling
When server fails,its sibling restores stateto replacement server
84
Architecture
Production network of clusters
• Network tolerates single server failure.
• Restores ability to tolerate a single failure.
ability to tolerate a sequence of failures
85
Outline
1. Introduction
2. Related work
3. API
4. Architecture
5. Experimental results
6. Current & future work
86
Preliminary experiments
• Experiments run on Linux cluster
– 100 port Lucent P550 Cajun Gigabit Switch
• Machine
– 2 Intel EtherExpress Pro 100 Mb/s Ethernet cards
– Red Hat Linux 6.0
– JDK 1.2.2_RC3
– Heterogeneous
• processor speeds
• processors/machine
87
Fibonacci Tasks with Synthetic Load
+
+
f(n-1)
+
+
f(n)
f(n-2)
execute( ) {
if ( n < 2 )
synthetic workload();
setArg( , n );
else {
synthetic workload();
spawn ( );
spawn ( );
spawn ( );
}
}
execute( ) {
synthetic workload();
setArg( , in[0] + in[1] );
}
88
TSEQ vs. T1 (seconds)Computing F(8)
Workload TSEQ T1 Efficiency
4.522 497.420 518.816 0.96
3.740 415.140 436.897 0.95
2.504 280.448 297.474 0.94
1.576 179.664 199.423 0.90
0.914 106.024 120.807 0.88
0.468 56.160 65.767 0.85
0.198 24.750 29.553 0.84
0.058 8.120 11.386 0.71
89
Parallel Efficiency over 60 nodes
0
0.2
0.4
0.6
0.8
1
1.2
F(13) Fib(14) Fib(15) Fib(16) Fib(17) Fib(18)
Par
alle
l E
ffic
ien
cy
Workload 1
Workload 2
Parallel efficiency for F(13) = 0.87Parallel efficiency for F(18) = 0.99
Average task time:Workload 1 = 1.8 sec.Workload 2 = 3.7 sec.
90
Outline
1. Introduction
2. Related work
3. API
4. Architecture
5. Experimental results
6. Current & future work
91
Current work
• Implement CX market maker (broker)
Solves discovery problem between Consumers & Production
networks
• Enhance Producer with Lea’s Fork/Join Framework
– See gee.cs.oswego.edu
CONSUMER PRODUCTIONNETWORKCONSUMERCONSUMERCONSUMER
PRODUCTIONNETWORK
PRODUCTIONNETWORK
PRODUCTIONNETWORK
MARKETMAKER} {
JINI Service
92
Current work
• Enhance computational model: branch & bound.
– Propagate new bounds thru production network: 3 steps
PRODUCTION NETWORK
SEARCH TREE
TERMINATE!
BRANCH
93
Current work
• Enhance computational model: branch & bound.
– Propagate new bounds thru production network: 3 steps
PRODUCTION NETWORK
SEARCH TREE
TERMINATE!
94
Current work
• Investigate computations that appear
ill-suited to adaptive parallelism
– SOR
– N-body.
95
End of CX Presentation
• www.cs.ucsb.edu/research/cx
• Next release: End of June, includes source.
• E-mail: [email protected]
96
Introduction
Fundamental Issues
• Communication latency
Long latency Overlap computation with communication.
• Robustness
Massive parallelism faults
• Scalability
Massive parallelism login privileges cannot be required.
• Ease of use
Jini easy upgrade of system components
97
Related work
• Market mechanisms– Huberman, Waldspurger, Malone, Miller &
Drexler, Newhouse & Darlington
98
Related work
• CX integrates
– DAG computational model
– Work-stealing scheduler
– Space-based, decoupled communication
– Fault-tolerance via eager scheduling
– Market mechanisms (incentive to participate)
99
Architecture Task identifier
• Dag has spawn tree• TaskID = path id• Root.TaskID = 0• TaskID used to detect
duplicate: – Tasks– Results.
F(4)
F(3) F(2)
+
F(2) F(1) F(1) F(0)
F(1) F(0)
+
+
+
0
000
2
1
1
1
1
22
2
100
Architecture: Basic Entities
• Consumer
Seeks computing resources.
• Producer
Offers computing resources.
• Task Server
Coordinates task distribution among its producers.
• Production Network
A network of task servers & their associated producers.
101
Defining Parallel Efficiency
• Scalar: Homogeneous set of P machines:
Parallel efficiency = (T1 / P) / TP
• Vector: Heterogeneous set of P machines:
P = [ P1, P2, …, Pd ], where there are
P1 machines of type 1,
P2 machines of type 2, …
Pd machines of type d :
Parallel efficiency = ( P1 / T1 + P2 / T2 + … Pd / Td ) –1 / TP
102
Future work
• Support special hardware / data: inter-server task
movement.
– Diffusion model:
Tasks are homogeneous gas atoms diffusing through network.
– N-body model: Each kind of atom (task) has its own:
• Mass (resistance to movement: code size, input size, …)
• attraction/repulsion to different servers
Or other “massive” entities, such as:
» special processors
» large data base.
103
Future Work
• CX preprocessor to simplify API.