Upload
quentin-singleton
View
217
Download
0
Embed Size (px)
Citation preview
The MRNet Tree-based Overlay Network
Where We’ve Been,Where We’re Going!
Dorian Arnold
Paradyn ProjectUniversity of Wisconsin
Philip C. Roth
Future Technologies GroupOak Ridge National
Laboratory
2
Abstract• Large scale systems are here• Tree-based Overlay Networks (TBŌNs)
– Intuitive, seemingly restrictive– Effective model for tool scalability– Prototype: www.paradyn.org/mrnet
• Where we’ve been– Tool scalability– Programming model for large class of applications
• Where we’re going– Topology studies– TBŌNs on high-performance networks– Filters on hardware accelerators– Reliability
3
HPC Trends from .
13
16
18
19
28
44
54
58
58
75
92
129
140
Jun-99
Nov-99
Jun-00
Nov-00
Jun-01
Nov-01
Jun-02
Nov-02
Jun-03
Nov-03
Jun-04
Nov-04
Jun-05
Nov-05
2
5
5
73
275
81
36
11
12
0 - 63
64 - 127
128 - 255
256 - 511
512 - 1023
1024 - 2047
2048 - 4095
4096 - 8191
> 8192
Growth in 1024-processor systems.November ’05
processor count distribution.
No Data Available
Easier than ever to deploy thousands of processors (one BG/L rack!)
4
An Example: ORNL National Center for Computational Sciences
5
Hierarchical Distributed Systems
• Hierarchical Topologies– Application Control– Data collection– Data reduction/analysis
• As scale increases, front-end becomes a bottleneck
FE
BEBE BEBE…
6
TBŌNs for Scalable Systems
TBŌNs for scalability– Scalable multicast
– Scalable gather
– Scalable data aggregation
FE
CPCP
CP
BEBE
CP
BEBE…
7
TBŌN ModelApplication Front-end
Application Back-ends
Tree ofCommunication Processes
FE
CPCP
CP
BEBE
CP
BEBE…
8
TBŌN Model
Reliable FIFO channels– Non-lossy– Duplicate suppressing– Non-corrupting
FE
CPCP
CP
BEBE
CP
BEBE…
9
TBŌN Model
FE
CPCP
CP
BEBE
CP
BEBE…
Application-level packet
Packet filter
Filter state
Channel state
10
TBŌN Model
Filter function:– Inputs a packet from each child– Outputs a single packet– Updates filter state
{output, new_state } ≡ f ( inputs, cur_state )
11
TBŌNs at Work• Multicast– ALMI [Pendarakis, Shi, Verma and Waldvogel ’01]– End System Multicast [Chu, Rao, Seshan and Zhang ’02]– Overcast [Jannotti, Gifford, Johnson, Kaashoek and O’Toole
’00]– RMX [Chawathe, McCanne and Brewer ’00]
• Multicast/gather (reduction)– Bistro (no reduction) [Bhattacharjee et al ’00]– Gathercast [Badrinath and Sudame ’00]– Lilith [Evensky, Gentile, Camp, and Armstrong ’97]– MRNet [Roth, Arnold and Miller ‘03]– Ygdrasil [Balle, Brett, Chen, LaFrance-Linden ’02]
• Distributed monitoring/sensing– Ganglia [Sacerdoti, Katz, Massie, Culler ’03]– Supermon (reduction) [Sottile and Minnich ’02]– TAG (reduction) [Madden, Franklin, Hellerstein and Hong
’02]
12
Example TBŌN Reductions• Simple
– Min, max, sum, count, average– Concatenate
• Complex– Clock synchronization [ Roth, Arnold, Miller ’03]
– Time-aligned aggregation [ Roth, Arnold,Miller ’03]
– Graph merging [Roth, Miller ’05]
– Equivalence relations [Roth, Arnold, Miller ‘03]– Mean-shift image segmentation [Arnold, Pack,
Miller ‘06]
13
TBŌNs for Tool Scalability
MRNet integrated into Paradyn• Efficient tool startup• Performance data analysis• Scalable visualization
14
TBŌNs for Scalable Applications
• Many algorithms equivalence computation– Equivalence/non-equivalence to summarize/analyze input data
• Streaming programming models• Possibly even for Bulk Synchronous Parallel programs
Application Input Filter Output
Trace Analysis Trace file Trace equivalence / Anomaly detector
Compressed traces, anomalous traces
Graph Merging Sub-graphs Sub-graph equivalence
Merged graphs
Data Clustering Data Files Object classifiers Partitioned data
15
TBŌNs for Scalable Applications: Mean-Shift Algorithm
• Clustering points in feature spaces
• Useful for image segmentation
• Prohibitively expensive as feature space complexity increases
Window
Centroid
16
TBŌNs for Scalable Applications: Mean-Shift Algorithm
1. Partition data into windows and calculate window densities
2. Keep windows above chosen density threshold
3. Run mean-shift on remaining windows
4. Keep local maxima as peaks
17
TBŌNs for Scalable Applications: Mean-Shift Algorithm
• Uses MRNet as general purpose programming paradigm
• Implements mean-shift in custom MRNet filters
~6x speedup withonly 6% more nodes
18
TBŌN Computational Model
• At large scales, suitable for algorithms with:– Complexity ≥ O(n), n input size
– Output size ≤ total input size*• Sometimes algorithm just runs faster on output
(better-behaved input)
– Output is in the same form as the inputs• E.g., if inputs are sets of elements, the output
should be a set of elements
19
Research and Development Directions
• TBŌN topology studies
• TBŌNs and high-performance networks
• Use of emerging technologies for TBŌN filters
• TBŌN Reliability
20
TBŌN Topology
We expect many factors influence “best” topology– Physical network topology and capabilities
– Expected traffic (type and volume)
– Desired reliability guarantees
– Cost of “extra” nodes
21
TBŌN Topology Investigation
• Previous studies used reasonable topologies
• How factors influence performance remains an open question
• Beginning rigorous effort to investigate this issue– Performance modeling– Empirical studies on variety of systems
22
High-end Network Support
• Current MRNet implementation uses TCP/IP sockets
• Many high-end networks provide TCP/IP support– E.g., IP over Quadrics QsNet– Flexible, but undesirable for performance reasons
• Effort underway to support alternative data transports– One-sided, OS/application bypass– Complements topology investigations – Initially targeting Portals on Cray XT3
23
High-Performance Filters on Hardware Accelerators
• Multi-paradigm computing (MPC) systems are here– MPC systems include several types of processors, such as
FPGAs, multi-core processors, GPUs, PPUs, MTA processors– E.g., Cray Adaptive Supercomputing strategy, SRC Computers,
Linux Networx, DRC FPGA co-processor
• Streaming approach expected to work well for some types
• Running filters on accelerators is natural fit for some applications, e.g. Sloan Digital Sky Survey and Large Synoptic Survey Telescope
24
Given the emergence of TBŌNs forscalable computing, low-cost
reliability for TBŌN environmentsbecomes critical!
TBŌN Reliability
1System
Size
MTTF
25
TBŌN Reliability• Goal
– Tolerate process failures– Avoid checkpoint overhead
• General concept: leverage TBŌN properties– Natural information redundancies
– Computational semantics• Lost state may be replaced by non-identical state• Computational equivalence: relaxed consistency model
• Zero-cost: no additional computation, storage or network overhead during normal operation– Define operations that compensate for lost state– Maintain computational equivalence
26
TBŌN Information Redundancies
Fundamental to the TBŌN Model
1. Input streams propagate toward root
2. Persistent state summarizes input history
3. Therefore, summary is replicated naturally as input propagates upstream
27
Recovery Strategyif failure is detected then1.Reconstruct tree
2.Regenerate compensatory state
3.Reintegrate state into tree
4.Resume normal operationend if
28
State Regeneration: Composition
CPi
CPkCPj
fs( CPi )
fs( CPj ) fs( CPk )
Parent’s state iscomposition of children’s
29
State Regeneration: Composition
State composition:– Input filter state from children– Output computationally-equivalent state for
parent
fs( CPi ) ≡ fs( CPj ) fs( CPk )
Child’s state Child’s stateParent’s state
CompositionOperator
30
State Regeneration: Composition
Where does this mysterious composition operation come from?
When filter’s new_state is copy of output;then f becomes composition operator.
{output, new_state } ≡ f (inputs, cur_state )
Recall filter definition:
31
State Regeneration: Composition
Proof Outline– State is history of processed inputs
– Children’s output becomes parent’s input
– Updated state is a copy of output• can be used as input to filter function
– Filter execution on children’s state will produce computationally equivalent state for parent
32
State Regeneration: Composition
Composition can also work when output is not a copy of the state!
– Requires mapping operation from filter state to output form
33
State Composition Example
CP0
CP2CP1
CP3 CP6CP4 CP5
4
3
3
1
5
1
3
4
5
1
1
1
8
1
9
5
{ }{ }
{ }
34
State Composition Example
CP0
CP2CP1
CP3 CP6CP4 CP5
4
3
3
1
5
1
3
4
5
1
1
1
8
1
9
5
{ }{ }
{ }
35
State Composition Example
CP0
CP2CP1
CP3 CP6CP4 CP5
4
3
1
5
3
4
5
1
1
8
9
5
{1}{1,3}
{ }
{1,3} {1}
36
State Composition Example
CP0
CP2CP1
CP3 CP6CP4 CP5
3
1
3
4
1
1
9
5
{1,5,8}{1,3,4,5}
{1,3}
{1,3,4,5} {1,5,8}
{1,3}
37
State Composition Example
CP0
CP2CP1
CP3 CP6CP4 CP5
1 4 1 5
{1,5,8,9}{1,3,4,5}
{1,3,4,5,8}
{1,3,4,5} {1,5,8,9}
{1,3,4,5,8}
{1,3}
38
State Composition Example
CP0
CP2CP1
CP3 CP6CP4 CP5
{1,5,8,9}{1,3,4,5}
{1,3,4,5,8,9}
{1,3,4,5} {1,5,8,9}
{1,3,4,5,8}
{1,3}
{1,3,4,5,8,9}
39
State Composition Example
CP0
CP2CP1
CP3 CP6CP4 CP5
{1,5,8,9}{1,3,4,5}
{1,3,4,5,8,9}
{1,3,4,5,8}
{1,3}
{1,3,4,5,8,9}
{1,3,4,5,8,9}
40
State Composition Example
CP0
CP2CP1
CP3 CP6CP4 CP5
3
1
3
4
1
1
9
5
{1,5,8}{1,3,4,5}
{1,3}
{1,3,4,5} {1,5,8}
{1,3}
CP0 crashes!
41
State Composition Example
• Use f on children’s state to regenerate computationally-consistent version of lost state
CP0
CP2CP1
CP3 CP6CP4 CP5
3
1
3
4
1
1
9
5
{1,5,8}{1,3,4,5}
{1,3}
{1,3,4,5} {1,5,8}
{1,3}
fs( CP0 ) ≡ fs( CP1) fs( CP2 )
42
State Composition Example
CP0
CP2CP1
CP3 CP6CP4 CP5
3
1
3
4
1
1
9
5
{1,5,8}{1,3,4,5}
{1,3}
{1,3,4,5} {1,5,8}
{1,3}
CP0
CP2CP1
CP3 CP6CP4 CP5
3
1
3
4
1
1
9
5
{1,5,8}{1,3,4,5}
{1,3,4,5,8}
{1,3}
fs( CP0 ) ≡ fs( CP1 ) fs( CP2 )
Non-identical, but computationally-consistent!
43
State Composition Example
CP0
CP2CP1
CP3 CP6CP4 CP5
1 4 1 5
{1,5,8,9}{1,3,4,5}
{1,3,4,5,8}
{1,3,4,5} {1,5,8,9}
{1,3}
CP0
CP2CP1
CP3 CP6CP4 CP5
1 4 1 5
{1,5,8,9}{1,3,4,5}
{1,3,4,5,8}
{1,3,4,5} {1,5,8,9}
{1,3,4,5,8}
{1,3}
44
State Composition Example
CP0
CP2CP1
CP3 CP6CP4 CP5
{1,5,8,9}{1,3,4,5}
{1,3,4,5,8,9}
{1,3,4,5} {1,5,8,9}
{1,3}
{1,3,4,5,8,9}
CP0
CP2CP1
CP3 CP6CP4 CP5
{1,5,8,9}{1,3,4,5}
{1,3,4,5,8,9}
{1,3,4,5} {1,5,8,9}
{1,3,4,5,8}
{1,3}
{1,3,4,5,8,9}
45
CP0
CP2CP1
CP3 CP6CP4 CP5
{1,5,8,9}{1,3,4,5}
{1,3,4,5,8,9}
{1,3}
{1,3,4,5,8,9}
{1,3,4,5,8,9}
CP0
CP2CP1
CP3 CP6CP4 CP5
{1,5,8,9}{1,3,4,5}
{1,3,4,5,8,9}
{1,3,4,5,8}
{1,3}
{1,3,4,5,8,9}
{1,3,4,5,8,9}
State Composition Example
46
Reliability Highlights• Zero-cost TBŌN reliability requirements:
1. Associative/commutative filter function2. Filter state and output have same
representation, or3. Known mapping from filter state
representation to output form
• Filter function used for regeneration• Many computations meet requirements
47
Other Issues• Compensating for lost messages
– Use computational state to compensate– Idempotent/non-idempotent computations
• Other state regeneration mechanisms– Decomposition
• Failure detection• Tree reconstruction• Evaluation of the recovery process
48
MRNet References• Arnold, Pack, and Miller: “Tree-based Overlay Networks for Scalable
Applications”, Workshop on High-Level Parallel Programming Models and Supportive Environments, April 2006.
• Roth and Miller, “The Distributed Performance Consultant and the Sub-Graph Folding Algorithm: On-line Automated Performance Diagnosis on Thousands of Processes”, Principles and Practice of Parallel Programming, March 2006.
• Schulz et al, “Scalable Dynamic Binary Instrumentation for Blue Gene/L”, Workshop on Binary Instrumentation and Applications, September, 2005.
• Roth, Arnold and Miller, “Benchmarking the MRNet Distributed Tool Infrastructure: Lessons Learned”, 2004 High-Performance Grid Computing Workshop, April 2004.
• Roth, Arnold, and Miller, “MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools”, SC 2003, November 2003.
49
Summary• TBŌN model suitable for many types of tools,
applications and algorithms
• Future work:– Evaluation of reliability mechanisms
• Coming real soon!
– Performance modeling to support topology decisions
– TBŌNs on emerging HPC networks and technologies
– Other application areas like GIS, Bioinformatics, data mining, …
50
Funding Acknowledgements• This research is sponsored in part by The National
Science Foundation under Grant EIA-0320708
• This research is also sponsored in part by the Office of Mathematical, Information, and Computational Sciences, Office of Science, U.S. Department of Energy under Contract No. DE-AC05-00OR22725 with UT-Battelle, LLC.
• Accordingly, the U.S. Government retains a non-exclusive, royalty-free license to publish or reproduce the published form of this contribution, or allow others to do so, for U.S. Government purposes.
51
EXTRA SLIDES
52
MRNet Front-end Interfacefront_end_main(){ Network * net = new Network (topology_conf);
Communicator * comm = net-> get_BroadcastCommunicator();
Stream * stream = new Stream( comm, IMAX_FILT, WAITFORALL);
stream->send(“%s”, “go”);
stream->recv(“%d”, result);}
front_end_main(){ Network * net = new Network (topology_conf);
Communicator * comm = net-> get_BroadcastCommunicator();
Stream * stream = new Stream( comm, IMAX_FILT, WAITFORALL);
stream->send(“%s”, “go”);
stream->recv(“%d”, result);}
53
MRNet Back-end Interface
back_end_main(){ Stream * stream; char *s;
Network * net = new Network();
net->recv(“%s”, &s, &stream);
if(s == “go”){ stream->send(“%d”, rand_int); }}
back_end_main(){ Stream * stream; char *s;
Network * net = new Network();
net->recv(“%s”, &s, &stream);
if(s == “go”){ stream->send(“%d”, rand_int); }}
54
MRNet Filter Interface
imax_filter(vector<Packet> packets_in, vector<Packet> packets_out){ for( i=0; i<packets_in.size; i++){ result = max( result, packets[i].get_int()); }
Packet p(“%d”, result);
packets_out.pushback(p);}
imax_filter(vector<Packet> packets_in, vector<Packet> packets_out){ for( i=0; i<packets_in.size; i++){ result = max( result, packets[i].get_int()); }
Packet p(“%d”, result);
packets_out.pushback(p);}