The network simplex method on a multiprocessor

The Network Simplex Method on a Multiprocessor Jorg Peters Center of Mathematical Sciences and Computer Sciences Department, University of Wisconsin-Madison, Madison, Wisconsin 53706

We compare several implemented approaches to parallelizing the network simplex method on the SEQUENT shared memory multiprocessor. The experiments underscore the importance of parallel pricing and show that specialized processes and single pivots are more efficient than are uniform processes with parallel pivots. We describe the PARNET implementation that combines the best features of the experimental codes. In its least parallel version, PARNET outperforms NETFLO, a standard sequential code, by a factor of 12. The total execution time (including i/o and statistics) for any problem with 5000 nodes and 25,000 arcs taken from a standard set of NETGEN benchmark problems is less than 25 sec wall clock time on the Sequent Symmetry S-81 using six processors. The incremental speedup is linear up to six processors on the test set and improves with the problem size and the density of the underlying graph. For a problem with 1000 nodes and 500,000 arcs, PARNET achieves incremental linear speedup up to 12 processors.

1. INTRODUCTION

Because of superior efficiency and stability, the network simplex algorithm has emerged as the standard method for solving the minimal cost network flow problem:

min cx

s.t. Ax = r (NF)

0 5 X 5 U .

where c , u, and x E Z", r E Z" and A is a node-arc incidence matrix of dimension m X n with rn 5 n. This paper reports a series of experiments to determine a suitable strategy for encoding the network simplex algorithm on the SEQUENT shared memory multiprocessor and describes the PARNET implementation

This research was supported in part by NSF grant CCR-8709952 and AFOSR grant AFOSR-86-0194 and by NSF DMS-8701275.

NETWORKS, Vol. 20 (1990) 845-859 0 1990 John Wiley & Sons, Inc. CCC 0028-3045/90/0700845-015$04.00

846 PETERS

that combines the best features of the experimental codes. Based on extensive tests, PARNET uses specialized parallel processes and stresses the importance of parallel pricing over parallel pivoting.

We shortly review the basics of the network simplex method referring to [3] for a detailed exposition. The minimal cost network flow problem is a linear program. It is special, however, in that A represents a graph and is more naturally dealt with in this sparse representation. The two non-zero entries in each column of A correspond to an edge of the graph: A “1” entry in row i and a “-1” entry in row j represent an arc from node i to node j . The unit costs of the flows, x, on the arcs are stored in c and the arc capacity in u. A negative (positive) entry in r indicates demand (supply) at the corresponding node. The efficiency of the network simplex algorithm is due to the fact that the subgraph corresponding to the basic part AB of A is a tree. To define the three major subalgorithms of the simplex approach, we split A , c, and x into a basis and a nonbasic part, e.g., A = (AB, AN) and define the “dual” variables rB : = CBA;’. Each subalgorithm corresponds to a graph operation.

Pricing [Selection of the pivot column]. The Pricing operation seeks (nonbasic) arcs i for which the “reduced cost” (cN - ? T ~ A ~ ) ~ , is negative. Conceptu- ally, we solve A x = r for x B , to obtain xB=Agl(r-ANxN). Since cx= cB + C N X N = const + (cN - TBAN)xN, arcs with a negative reduced cost can decrease the objective function as they enter the basis. By maintaining the dual variables, the reduced cost of an arc i , from node k to node I , is easily checked as ci - ?rk + r1 < 0.

Cycling [Selection of the pivot row]. The Cycling operation determines how the structure and flows of the basis tree change as a new arc enters the basis. Adding an arc to the basis tree produces a cycle on which flow can be redistrib- uted. If the flow change is infinite, the problem is unbounded. Otherwise, the (first) arc that allows the least flow change leaves the cycle. (If this arc is the entering arc, it is set to its upper bound and the basis tree remains unaltered.) In our implementations, Cycling determines the bounding arc, the bounding flow, and the four nodes cut, notcut, j o i n and severed that contain the information for restructuring the tree after the basis change. As illustrated in Figure 1.1, cut and notcut are the nodes connected by the entering arc. Cut is the node above which the old tree is cut as the bounding arc leaves the basis. Severed is the node on the bounding arc furthest away from the root of the basis tree. Join is the first common ancestor of cut and notcut.

Updating changes the flow, the dual variables, and the structure of the tree. The flow within the cycle changes according to the Cycling analysis and so do the dual variables of the reattached subtree, that is, of the subtree originally attached to severed and now reattached as a subtree of notcut. In our implementation, arcs leaving the basis at the upper bound reverse orientation.

Sections and Content

In Section 2, we, characterize two approaches to parallelizing the network simplex method, namely, “uniform parallelism” with parallel pivots and “speci-

NETWORK SIMPLEX METHOD ON A MULTIPROCESSOR 847

to root

m v nodes

join

A subtrees

3 cut notcut A

FIG. 1.1. Schema of a network pivot.

alized parallelism’’ with single pivots. Uniform parallelism is shown to be more complex and less efficient. PARNET relies on specialized parallelism as detailed in Section 3. The performance of the code on the SEQUENT shared memory multiprocessor is charted in Section 4. We solve standard benchmark problems with 12,500 to 500,000 variables and 1000 to 50,000 constraint equations. We compare PARNET with a standard sequential code, NETFLO, and show that the incremental linear speedup improves with the problem size and the density of the underlying graph. Section 5 concludes with experimental data that show that PARNET executes fewer expensive pivots as the number of processors increases.

2. TWO APPROACHES TO PARALLELIZING SIMPLEX-BASED ALGORITHMS

We now examine the basic operations for the multiprocessor setup. As a general principle, arc information is only read and, hence, needs not be protected by locks. Pricing is easily parallelized by associating each processor with a fixed set of arcs, e.g., the ith processor with the ith part of the arc list sorted by nodes. However, for the results to be accurate, the duals must be correct. This implies that a “timely” update of the values and structure of the basis tree is the main challenge. Since changes in the tree structure interfere with traversal, Cycling and Updating have to be synchronized. This gives rise to two general approaches to parallelizing the network simplex method.

848 PETERS

2.1. Uniform Parallelism with Parallel Pivots

If each processor has an identical code and performs Pricing, Cycling, and Updating on its own, we call the parallellism uniform. Note that uniform parallelism implies concurrent pivots. Two specific strategies are characterized by the granularity of the objects locked for synchronization, i.e., by how dy- namically processors can be associated with parts of the underlying graph.

In the subgraph locking approach, every process owns a collection of subgraphs, each protected by a lock. If a process needs nodes or arcs outside its collection, it acquires subgraphs from other processes. For network flow problems, the subgraphs are subtrees and the efficiency of the approach depends crucially on the existence of a large number of “independent” subtrees at all stages of the computation, so that processors need not compete and wait for locked parts of the graph.

In the cycle locking approach, each process tries to acquire all nodes of a cycle. Each node is protected by a lock. The locks (nodes) are owned just as long as is necessary to perform a pivot. If a process cannot obtain a complete cycle, it backtracks and relinquishes the nodes to avoid deadlock. The efficiency of Cycle locking depends on the existence of a large number of nonoverlapping cycles throughout the computation. Figure 2.1 illustrates the concept of uniform parallelism.

Uniform parallelism is conceptually well suited for largely independent sub- problems, e.g., for “staircase”-structured constraint matrices. For standard NETGEN problems (see Section 4); however, the approach turns out to be inefficient. To assess the usefulness of the subgraph locking approach, we counted independent subtrees during the execution of PARNET runs. Over all problems with 5000 nodes, we counted an average of just 2.5 (maximally 10) independent subtrees of size 30 or larger at any point in time. The same standard set of problems also caused considerable amounts of backtracking and idle

T I M E

1

Proc 0 Pricing Cycling Updating

Pricing Cycling

Proc2 Pricing I Cycling Updating

Pricing Cycling Updating

output I FIG. 2.1. Uniform parallelism-processes have identical code.


waiting for the cycle locking code. The number of aborted pivots increased dramatically in the second half of the computation. A hybrid approach, with PARNET as a second stage, turned out to be inefficient, since the subtrees assembled in the “efficient” first half were far from optimal and needed as much time for restructuring as PARNET alone does for solving the problem from start. A likely explanation for the poor performance of the approach is that arcs causing large cycles have a low chance of entering the basis, yet may be crucial for an efficient path to the global optimum. We refer to [4] for details of analysis and implementation.

2.2. Specialized Parallelism with Single Pivots

If processors have differing, specialized codes and perform only one or two of the three basic operations, we call the parallelism specialized. Locking overhead can be minimized by performing only one pivot per time slot. Again, we distinguish two strategies.

The pricing heuristics approach emphasizes Pricing: n - 1 processors compete

T I M E

1

T I M E

1 FIG. 2.2.

Input a nd Initialization -- - proc 2 Pricing

proc 1 Pricing Pricing

output

Inmt and Initialization

Cycling Updating

Updating

output I Pricing heuristics schema (top) and parallel update schema (bottom).

850 PETERS

in the search the most promising arc (in the sense of the heuristic), while the nth processor performs Cycling and Updating. Pricing, on possibly incorrect data, continues during the update. The efficiency of the approach depends on the robustness of the pricing heuristic, since it is modified by incorrect data and interrupted as soon as the nth processor finishes its pivot. The approach has no idle waiting.

The parallel update approach consists of a parallel pricing step, a stage during which one processor determines the cycle, and a parallel update phase including all processors as illustrated in Figure 2.2.

Since updating turns out to be relatively cheap, but requires some overhead when done in parallel, the pure parallel update strategy is not competitive. The (simple) pricing heuristics approach, however, performs surprisingly well. It serves as a basis for PARNET.

3. PARNET

PARNET combines the best features of the parallel updating and the pricing heuristics approach. We first describe the data structures. The m X n constraint matrix A is represented as a graph with n arcs: for each arc i, we record the node from which the arc emanates (FROMi), the node to which it leads (INTOi), the associated cost (COSTi) and the capacity (CAP,.). A bit array (FLIP) keeps track of variables at upper bounds. The primal variables (FLOW) and the dual variables (DUAL) need only be recorded for basic variables, and, hence, are, like all further arrays, of length m. The basis tree is defined by specifying a predecessor (PRED) for each node. A preorder list of nodes (SUCC) reduces the cost of visiting the nodes of a subtree, and the number of descendents (NDES), stored with each node, helps locating join during the Cycling operation. Since NDES records the size of subtrees, PARNET also uses it to decide whether to divide work among several processors or avoid distribution overhead. Finally, ASUC, with i = ASUC[SUCC[i]] helps to efficiently locate the predecessor of severed.

Communication between the processes is restricted to two priority queues (cf. [5]). Both play a crucial role for the efficiency of PARNET. The setup is illustrated in Figure 3.1; po denotes processor 0 and pi any processor in

The head of the price-queue stores the currently most promising arc. PAR- NET chooses the priority queue to be a stack. The stack is filled by the pi and emptied by po , and the most promising arc is the arc with the most negative reduced cost. When one of the p i finds an arc with negative reduced cost, it locks the price-queue and deposits the newly found arc on top - provided the arc is more promising than is the arc currently at the top. This enforces a global comparison of the candidate arcs as opposed to the local choice of uniform parallelism. The pi price throughout the update at a rate measured as one arc for each node updated by po . After the update is completed, pa acquires the stack and returns a new empty stack to the pi. If the stack acquired by po is

(1, . . . , n - 1).


FIG. 3.1. PARNET schema.

empty (occurs usually three maximally eight times at the end of the computation), po starts to price all arcs itself. If po finds no arc with negative reduced cost and the search is not interrupted by any new entry in the price-queue, po notifies the pi and the algorithm terminates. Otherwise, po reprices the first few arcs on the stack to make sure that the prices and the ordering are correct. Since the update progresses as the pi price out, the ordering of the stack is more reliable toward the top. In its current version, PARNET improves this basic setup by maintaining several unlocked price-queues, one for each pi, i E ( 2 , . . . , n- 1). Process pi reprices these local queues periodically and selects the best arcs for po.

The work-queue transfers work from po, to the pi. PARNET uses a stack (currently limited to three groups of entries since the update is not fully parallelized). When the number of nodes on the path from join to notcuf (counted during Cycling) exceeds the parameter wqrnin, then p o enters the structure and flow changes of the path into the workqueue. Similarly, if NDES[cut] exceeds wqrnin, the data for updating the subtree of cut are entered. The pi check the work-queue frequently, and if there is an entry, lock the work-queue, remove the entry, unlock, and perform the update. PARNET uses wqrnin = 10, so that the extra work for po in entering the information is less than the update work.

PARNET is summarized as follows:

Step 0:

Step 1:

pi and po: Read and initialize the artificial starting basis for the big M method. pi: Price out. Check the work-queue. If the work-queue is nonempty, acquire work. If notified by po go to step 2. Repeat step 1. PO: If there is no valid entry in the price-queue, check for termi- nation. If no arc with negative reduced cost is found, notify the pi and go to 2. Take the most promising arc from the price-queue. Cycle and update (fill the work-queue). Repeat step 1. pi and po: Output flows. Compute the objective function value. Step 2:

852 PETERS

4. TESTING PARNET

We tested PARNET extensively on NETGEN benchmark problems. A NETGEN problem is generated from a list of parameters, e.g., the number of nodes and the percentage of bounded arcs (cf. [2]). The basic test problem, 101, and the variations are displayed in Table I1 at the end of this section. PARNET's performance was measured on two versions of the SEQUENT multiprocessor, the older Balance B-21000 with 8 NS 32032 processors and the (January 1988) Symmetry S-81 [6] with 10 Intel 80386 processors. The newer version has fewer integer registers, but the processors are approximately three times as fast and the cache size is doubled to 16 kbytes. Physical memory is currently limited to 40 Mbytes.

We compare PARNET first with NETFLO ([3]), a standard sequential code, and then measure the speedup of PARNET 0' + 1) over PARNETG), i.e. of PARNET with j + 1 processor over PARNET with j processors.

4.1. PARNET(3) Compared to a Single Processor Code

In its current version, PARNET uses at least 3 processors. Table I below compares PARNET(3) with NEWLO on the Symmetry version for a represent- ative set of NETGEN problems. Depending on the problem, PARNET(3) is 10-20 times faster than NETFLO. The conspicuous difference in the number of iterations is explained in Section 5.

TABLE I. PARNET(3) vs. NETFLO.

Time Iterations pbm NETFLO PAFWET(3) PARNET(7) NETFLO PARNET(3)

101 729.30 58.97 16.18 50307 14739 104 802.60 65.33 25.96 51006 36095 106 399.60 29.98 13.32 32528 11421 110 430.50 28.19 12.42 31242 10804 115 698.10 50.11 14.22 44857 13041 116 636.30 65.99 16.69 36618 13303 117 344.90 24.54 10.50 23707 9269 121 802.70 81.62 30.31 89088 26535 122 802.60 72.28 21.91 78026 21378 126 441.60 31.63 14.70 41036 12342 130 500.50 32.26 16.06 43681 12470 134 195.40 23.94 6.26 15749 9891 138 1308.80 101.33 42.15 108822 31022 142 802.60 48.21 14.24 64964 16829 144 794.20 71.03 19.05 71496 19249 147 2530.70 116.40 37.59 195150 27297 150 802.60 71.62 15.13 85584 20797


4.2. Speedup

An important aspect of any parallel code is how efficiently it uses additional processors. Since the amount of data collected for PARNET is considerable (50 test problems, three runs, three to 13 processors) and since the details have been reported in [4], Figure 4.2 displays only the aggregated and averaged results for problems of the same arc size but otherwise differing characteristics. PARNET's primary goal is to achieve a low total execution time. Consequently, performance is measured in wall clock time (in seconds) from start to end of the program. This includes input, initialization, computation, and output of the primal (and dual) solution, as well as some statistics. Figure 4.2 shows real time vs. the number of processors rather than the time for 1 processor divided by the time for n processors vs. n , since the diagonal of such a speedup diagram is meaningless unless the single processor program is optimal. (The speedup improves as the single processor program is made more inefficient.) Two types of lines indicate changes in efficiency as processors are added. The solid line gives the measured times as the number of processors increases (with the data points connected to help the eye). The dotted line emananating from each data

230 -t

200 -

150 -

T I M - E 100-

50 -

750

500

375

250

125

- PARNET linear speedup + .................... +

125: 12,500 arcs 250 25,000 arcs 375: 37,500 arcs 500: 50,000 arcs 750 75,000 arcs

1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 NUMBER OF PROCESSORS

FIG. 4.2. Average running times for PARNET on NETGEN benchmark problems.

854 PETERS

1400 7

1300 - 1200 - 1100 - lo00 - 900 - 800 -

T

M 1 7 0 0 -

500 - 400 - 300 -

0 PARNET

linear speedup + ............................... +

1014 (2oooO by 101454)

2021

1520

1520 (30000 by 152057)

202725)

253419)

04 1

1 2 3 4 5 6 7 8 9 10 NUMBER OF PROCESSORS

FIG. 4.3. PARNET running times for large NETGEN problems.

point shows the incremental linear speedup from j to j + 1 processors; that is, the maximal possible reduction in the computation time as the j + 1st processor is added to the workforce assuming that the total amount of work remains unchanged. If the additional processor decreases the overall work, e.g., by choosing better pivots, the dotted line can lie above the solid line. PARNET(j) would then do better by simulating PARNET(j + 1) (which is, however, diffi- cult, since the parallel computations are nondeterministic). We say that an additional processor is used eficiently if the incremental linear speedup is at least liqear.

Figure 4.2 shows that, depending on the problem size, there is a break-even point for efficiency at around six processors. The precise point is a function of the pricing rate per dual update (cf. Section 3) and the size and density of the underlying graph (see below); if k processors can price all arcs during one update, a k+lst processors will only increase the total execution time, since p 1 is slowed down by the additional price-queue.

However, Figure 4.2 suggests that more processors can be used efficiently as the problem size increases. To test this hypothesis we created four large problems, based on the characteristics of problem 101, but with 4, 6, 8, and 10 times as many nodes, arcs, sinks, sources, and supplies. We ran each problem three


400

300

T

M E

I 200

100

0

- PARNET + + linear speedup

134 (lo00 by 25000) 238: (lo00 by lOO000) 239: (lo00 by 250000) 240 (1000 by 500000)

‘i 23

13 *

1 2 3 4 . 5 6 7 8 9 1 0 1 1 1 2 1 3 NUMBER OF PROCESSORS

FIG. 4.4. PARNET running times for dense NETGEN problems.

times. Figure 4.3 supports the claim that the parallelism improves with increas- ing problem size; but the effect is weaker than we hoped for. Only one additional processor is used efficiently as the size of the problem increases from (5000 X 50,000) to (50,000 x 250,000). Howevsr, Figure 4.4 asserts that more processors can be used efficiently as the density of the underlying graph increases; that is, more processors can be used efficiently to deal with the com- binatorial aspect of the network flow problem - quite in contrast to uniform parallelism that depends on sparse graphs, so that subgraphs (trees or cycles) do not overlap. The problems of Figure 4.4 are of type 101, but with the number of nodes fixed at 1000 and the number of arcs growing from 25,000 to half the number of the complete graph. Problem 134 exhibits a relative linear speedup up to eight processors, problem 238 up to nine, problem 239 up to 10, and problem 240 up to 12 processors.

5. PARNET REDUCES THE NUMBER OF COSTLY PIVOTS

This section attempts to explain why the PARNET approach is efficient. To this end, we analyze the distribution of work over time. The data were collected during the timed runs to make sure that the analysis applies to the performance measured in Section 4. (The overhead was minimal and is included in the

856 PETERS

"1

300

T

300

200

0

/i P

; i

d 0: 1023

k O 2 - 4

102-2

0 5 10 15 20 2s NUMBER OF PIVOTS (in 1ooO)

123-2

T I I M I E

1 M

9; d'

a ,' P 123-3

P 9' $'/'

-"- I ,a'?' 123-4

0 5 10 1s 20 2s NUMBER OF PIVOTS (in 1ooO)

FIG. 5.1. Pricing heuristics decrease the number of expensive pivots.

NETWORK SIMPLEX METHOD ON A MULTIPROCESSOR

700 -

600-

N 500- U M B

: m -

0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 1 3 1 4 1 5 NUMBER OF PIVOTS (in IOOO)

* O avg. CYCLE size

- avg. SUBTREE size P

0 1 2 3 4 5 6 7 8 9 1 0 1 1 1 2 NUMBER OF PIVOTS

I 123-9

123-9

857

FIG. 5.2. Expensive updates coincide with the re-attachment of large subtrees.

858 PETERS

TABLE 11. NETGEN benchmark problems (transshipment problems). Problem 101 Variations

Total supply 250,000 12,5006,250,000 Nodes 5,000 1,000-10,000 Sources 2,500 (500) 50-2,500 Sinks 2,500 (500) 50-2,500 Arcs 25 ,000 12,500-75,000 Min. cost 1 -100-1,001 Max. cost 100 - 1-1,100

% high cost arcs 0 % capacitated arcs 100 0-100 Min. capacity 1

Transportation sources 0 (500) (50-1,500) Transportation sinks 0 (500) (50-1,500)

Max. capacity 1,000 50-5,000

timings.) Figure 5.1 plots time vs. pivots to show the cost of pivots as the computation proceeds. Since averaging the data over several problems would defeat the purpose, we display two typical examples (from the earlier, locked price-queue version of PARNET). The number of processors used is indicated after the problem name, i.e., 1 2 3 3 labels the graph of problem 123 solved with PARNET(3). The first 6,000 pivots of problem 102 with PARNET(5), for example, take about 19 sec and the first 14,000 pivots take 51 sec.

From the data, it is clear that not only the number of pivots, but also the number of expensive pivots decreases as more processors enter the pricing process. In fact, for all problems of Section 4, the characteristic “knee”, i.e., the maximum of the second difference of each graph, remains fixed as the number of processors increases and the total number of pivots decreases.

Finally, Figure 5.2 explains why later pivots are more expensive. We compare the average cycle size and the average size of the reattached subtrees with the number of kilopivots. The data show that updates are expensive, because large subtrees are reattached. Conversely, this implies that PARNET is efficient, because its basis subtrees change little. Since we use additional processing power largely to improve pricing, we conjecture that thorough pricing has a stabilizing influence on the evolution of basis trees.

Further research is under way to test whether the ideas of this approach carry over to more general simplex-based methods (see e.g. [7]).

I thank R. R. Meyer for his support, J. L. Kennington for his version of NETFLO, and J. Mote for NETGEN and the test problem set.

REFERENCES [l] M. D. Chang, M. Enqquist, R. Finkel, and R. R. Meyer, A parallel algorithm for

generalized networks. Parallel Optimization on Novel Computer Architectures 14 (1988) 125-145.


[2] D. Klingman, A. Napier, and J. Stutz, NETGEN: A program for generating large scale capacitated assignment, transportation, and minimum cost flow network problems. Management Sci. 20(5) (1974).

[3] J. Kennington and R. Helgason, Algorithms for Network Flow Programming. John Wiley, New York (1982). -

[4] J. Peters, A parallel algorithm for minimal cost network flow problems. Technical Report no. 762, Dept of Computer Sciences, University of Wisconsin-Madison, April 1988.

[S] Robert Sedgewick, Algorithms, Addison Wesley, Reading, Massachusetts (1982). [6] SymmetryTM technical summary. Sequent Computer Systems, Znc. (1987). [7] R. H. Clark and R. R. Meyer, Parallel arc-allocation algorithms optimizing general-

ized networks. Technical Report no. 862, Dept of Computer Sciences, University of Wisconsin-Madison, July 1989.

Received July 1988 Accepted September 1989

Documents

The network simplex method on a multiprocessor