Iterative list scheduling for heterogeneous computing

J. Parallel Distrib. Comput. 65 (2005) 654–665www.elsevier.com/locate/jpdc

Iterative list scheduling for heterogeneous computing

G.Q. Liu∗, K.L. Poh, M. XieDepartment of Industrial and Systems Engineering, National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260, Singapore

Received 23 March 2003; received in revised form 6 July 2004; accepted 15 January 2005

Abstract

Optimal scheduling of parallel applications on distributed computing systems represented by directed acyclic graph (DAG) is NP-complete in the general case. List scheduling is a very popular heuristic method for DAG-based scheduling. However, it is more suitedto homogenous distributed computing systems. This paper presents an iterative list scheduling algorithm to deal with scheduling onheterogeneous computing systems. The main idea in this iterative scheduling algorithm is to improve the quality of the schedule in aniterative manner using results from previous iterations. The algorithm first uses the heterogeneous earliest-finish-time (HEFT) algorithmto find an initial schedule and iteratively improves it. Hence the algorithm can potentially produce shorter schedule length. The simulationresults show that in the majority of the cases, there is significant improvement to the initial schedule. The algorithm is also found toperform best when the tasks to processors ratio is large.© 2005 Elsevier Inc. All rights reserved.

Keywords:Task scheduling; Heterogeneous computing systems; List scheduling; Randomly generated DAGs

1. Introduction

The scheduling of parallel applications is highly criticalto the effective performance of a distributed computing sys-tem. A popular representation of a parallel application is thedirected acyclic graph (DAG) in which the nodes representapplication tasks and the directed arcs or edges representinter-task dependencies, such as task’s precedence. As theproblem of finding the optimal schedule is NP-complete [9]in the general case, several heuristic algorithms have beenproposed. These algorithms may be broadly classified intothe following four categories:

• Task-duplication-based (TDB) scheduling[1–5,12,21–23].

• Bound number of processors (BNP) scheduling[6,7,10,17,18,20,24,25,29,31].

• Unbounded number of clusters (UNC) scheduling[13,14,26,28,31,32].

• Arbitrary network topology (ANP) scheduling[8,19,27].

∗ Corresponding author.E-mail address:[email protected](G.Q. Liu).

0743-7315/$ - see front matter © 2005 Elsevier Inc. All rights reserved.doi:10.1016/j.jpdc.2005.01.002

In TDB scheduling, the basic idea is to reduce the commu-nication overhead by allocating some tasks to multiple pro-cessors. Non-TDB algorithms which assume arbitrary taskgraphs with arbitrary time on nodes and edges can be di-vided into two categories: one category assumes that the pro-cessors are fully connected to each other, which means thatthere is no communication content; the other category as-sumes that the processors are linked by an arbitrary networktopology (ANP), which means the scheduling process mustconsider the communication contention. The former cate-gory can be further divided into two categories: unboundednumber of clusters (UNC) scheduling algorithms and boundnumber of processors (BNP) scheduling. The algorithm pre-sented in this paper belongs to the last category. More de-tailed descriptions and classifications of various schedulingstrategies can be found in[16].

List scheduling is a very popular method for BNP schedul-ing. The basic idea of list scheduling is to assign prioritiesto the tasks of the DAG and place the tasks in a list arrangedin descending order of priorities. A task with a higher prior-ity is scheduled before a task with a lower priority and tiesare broken using some method. To compute the priorities

http://www.elsevier.com/locate/jpdc

mailto:[email protected]

G.Q. Liu et al. / J. Parallel Distrib. Comput. 65 (2005) 654–665 655

of the tasks, the DAG must be labeled with the computationtime of the tasks and the communication times of the edges.We differentiate the computation time label of a task in aDAG from the actual computation times of a task on all theprocessors, and refer to the former as thetime-weightof thetask. Similarly, we refer to the communication time label ofan edge on the DAG as thetime-weightof the edge.

In a homogeneous distributed computing system, the com-putation times of a task on different processors are the same.Hence, the time-weight of a node is its computation time onany processor. Similarly, in a homogeneous distributed com-puting system, the communication times between two taskson any link are the same. Hence the time-weight of an edgeis the communication time between the corresponding twotasks on any link. In a heterogeneous distributed computingsystem, on the other hand, the computation time of a task ondifferent processors may be different, and so is the commu-nication time between two tasks on different links. Hence,the time-weight of every node and the time-weight of everyedge which are labeled on the DAG have to be computedduring the scheduling process.

Several variant list schedulings have been proposed to dealwith the heterogeneous environment, for example, mappingheuristic (MH) [8], dynamic-level scheduling (DLS) algo-rithm [27], levelized-min time (LMT) algorithm [11], andheterogeneous earliest-finish-time (HEFT) algorithm [29].The HEFT algorithm [29] significantly outperforms the DLSalgorithm, MH, and LMT algorithm in terms of averageschedule length ratio, speedup, etc. The HEFT algorithm se-lects the task with the so-called highest upward rank valueat each step and assigns the selected task to the processorwhich minimizes its earliest finish time with an insert-basedpolicy. When computing the priorities, the algorithm usesthe task’s mean computation time on all processors and themean communication rates on all links. We believe that themean is inadequate for task scheduling.

In this paper, an iterative algorithm that uses list schedul-ing for task allocation in heterogeneous computing systemsis proposed and investigated. The algorithm generates an ini-tial solution with moderate quality and then improves the so-lution iteratively. The priority for constructing the schedul-ing list and the processor selection policy are selected ac-cording to the conclusions of Kwok and Ahmad [15]. Ineach iteration step, the time-weights of the nodes and edgesof the DAG are updated using results from the previous iter-ation. The initial solution is obtained by the mean computa-tion times of all tasks on all processors as the time-weight ofthe corresponding node and the mean communication timeof all communication links as the time-weight of the corre-sponding edge. During the iterative steps, the results of theprevious iteration are used to compute and update the time-weights of the nodes and edges in order to construct a newlist. The algorithm keeps the best solution found during theiterations and returns it on termination. The initial step ofour algorithm is the same as the HEFT algorithm [29]. How-ever, subsequent schedule improvements, it can potentially

find better schedule than most of the algorithms mentionedearlier on. The algorithm has been tested on a large numberof randomly generated problems of different sizes and tworeal applications. It is found that in the majority of the cases,there is a significant improvement made to the initial sched-ules, which means that the proposed algorithm outperformsHEFT algorithm, DLS algorithm, MH, LMT algorithm interm of the average schedule length. In particular, the algo-rithm performs better when the tasks to processors ratio islarge.

This paper is organized as follows. In Section2, a formaldescription of the task scheduling problem is given. In Sec-tion 3, the scheduling algorithm is introduced. In Section 4, anumerical example is shown. In Section 5, the performanceof our algorithm in various heterogeneous computing sys-tems is investigated. Finally Section 6 concludes the paper.

2. Task-scheduling problem

Notation

v the number of tasks in the applicationvi the ith task in the applicationei,j the directed link fromith task tojth taskp the number of processors available in the sys-

tempi the ith processor in the systemwi,j the computation time to complete taskvi on

processorpidi,j the data transfer size (in bytes) from taskvi

to taskvjri,j the communication rate (in bytes/s) between

processorpi and processorpjci,j,k,l the communication time from taskvi to task

vj when taskvi was assigned to processorpkand taskvj was assigned to processorpl

wsi the time-weight of taskvi during thesth iter-

ation, which is used to compute the prioritiesof the tasks

csi,j the time-weight of the directed edge fromtask vi to task vj during the sth iterationwhich is used to compute the priorities of thetasks

EST(vi, pj ) the earliest computation start time of taskvion processorpj

EFT(vi, pj ) the earliest computation finish time of taskvion processorpj .

An application is represented by a directed acyclic graphG = (V,E), whereV is the set ofv tasks that can be executedon any of the available processors;E ⊆ V × V is the set ofe directed arcs or edges between the tasks representing thedependency between the tasks. For example, ifei,j ∈ E, thentaskvj cannot start before taskvi completes its execution.A task may have one or more inputs. When all its inputs areavailable, the task is triggered to execute. After its execution,

656 G.Q. Liu et al. / J. Parallel Distrib. Comput. 65 (2005) 654–665

it generates its outputs. A task with no parent node in theDAG is called anentry taskand a task with no child nodein the DAG is called anexit task. Without loss of generality,we assume that the DAG has exactly one entry taskventryand one exit taskvexit. If multiple exit tasks or entry tasksexist, they may be connected with zero time-weight edgesto a single pseudo-exit task or a single entry task that haszero time-weight. In addition, the system includes a set ofp processors which are assumed to be fully linked to eachother. Hence there is no communication content[5].

The communication timeci,j,k,l from taskvi to taskvj ,when taskvi was assigned to processorpk and taskvj wasassigned to processorpl is

ci,j,k,l = di,j /rk,l . (1)

The entry taskventry earliest execution start time on proces-sorpj is

EST(ventry, pj ) = 0 (2)

To compute the earliest execution start time of other tasks,the assignment of the immediate predecessor tasks must beknown. Let us assume thatvk is one of the immediate prede-cessor tasks ofvi andvk was assigned to the processorplk .The earliest execution start time of taskvi on processorpj is

EST(vi, pj )= max{Available(vi, pj ),max

vk∈pred(vi )(EFT(vk, plk )

+ck,i,lk,j )}, (3)

whereAvailable(vi, pj ) is the earliest time when proces-sorpj is available for taskvi execution;pred(vi) = {vj ∈V|eji ∈ E} is the set of immediate predecessors of taskvi ;ck,i,lk,j is the communication time between taskvk and taskvi given that taskvk was assigned to processorplk and taskvi was assigned to processorpj . The inner maximizationblock in Eq. (3) returns theready_time, i.e., the time whenall data needed by taskvi has arrived at processorpj .

The entry taskventry earliest execution finish time on pro-cessorpj is

EFT(ventry, pj ) = wentry,j . (4)

For other tasks, the earliest execution finish time of taskvion processorpj is

EFT(vi, pj ) = wi,j + EST(vi, pj ). (5)

After all tasks in the DAG are scheduled to satisfy all prece-dence constraints, the schedule lengthL is the earliest finishtime of the exit taskvexit. That is

L = EFT(vexit, pj ), (6)

where exit taskvexit has been assigned to processorpj .The primary objective of the scheduling problem is to

minimize the schedule lengthL by determining the assign-ment of tasks to processors subject to the tasks dependencyconstraints.

3. Iterative list scheduling algorithm

3.1. Some graph attributes used by our algorithm

We define the time length of a directed path from tasksvi to vj as the sum of all the tasks’ (includingvi and vj )and the edges’ time-weights along the path betweenvi andvj . The bottom-level (b-level) [15] of a taskvi is the longesttime length from taskvi to the exit task and is bounded bythe time length of the critical path of the graph. Theb-levelof a task is a dynamic attribute because the time-weightof an edge may be zeroed when the two incident tasks arescheduled to the same processor. A critical path (CP) of aDAG is a path from the entry task to the exit task, whosetime length is the maximum.

3.2. The priority selection

Kwok and Ahmad [15] compared several list schedul-ing algorithms on a common homogeneous platform andconcluded that the modified critical-path (MCP) [31] algo-rithm performs better than others in terms of schedule lengthand running time. The MCP algorithm uses the as-late-as-possible (ALAP) time of a task as the priority. The ALAPtime of a task is computed by first computing the time lengthof CP and then subtracting theb-levelof the task from it.First, the MCP algorithm computes the ALAP times of allthe tasks and then constructs a list of tasks in ascendingorder of ALAP times. Ties are broken by considering theALAP times of the children of a task. The tasks on the listare then scheduled using the insertion approach, one by oneto a processor that allows the earliest possible start time.

Because the length of the CP is a constant, our algorithmuses theb-levelof a task as the priority. Our algorithm firstcomputes theb-levelsof all tasks and then constructs a list oftasks in descending order ofb-levelvalues. Ties inb-levelsare recursively broken using the tasks’ children’sb-levels.

3.3. Scheduling list construction

To construct the scheduling list for the initial solution,the time-weight of every task must be known. The initialtime-weight of taskvi is assigned the mean value of thecomputation time of taskvi on all processors. That is

w0i =

p∑j=1

wi,j

p. (7)

Similarly, the initial time-weight of the edge from tasksvito vj based on the mean value across all the fully connectinglinks is

c0i,j = di,j

p−1∑k=1

p∑l=k+1

rk,l

/((p2 − p)/2)

. (8)


At the sth iteration, suppose taskvi was allocated to pro-cessorpk and taskvj was allocated to processorpl at theprevious iteration, then the time-weight of taskvi is

wsi =

�wi,k +

m=p∑m=1,m�=k

wi,m

/

(p + � − 1), (9)

where� is a non-negative constant. The parameter� is re-ferred to as weighting factor, which has to be determinedeither heuristically or empirically.ws

i is a weighted meanof the computation timewi,j of taskvi on all processors. If� > 1, then more weight is put onwi,k. Because processorpk is the processor that taskvi was allocated to during the(s−1)th iteration, when we compute the time-weight of taskvi for sth iteration, i.e.,ws

i we put a weight on the compu-tation time of taskvi on processorpk, i.e.,wi,k. Hence, thetime-weight of a task forsth iteration depends on the assign-ments of the previous iteration, i.e., the(s − 1)th iteration.

The time-weight of the edge from taskvi to taskvj at thesth iteration is

csi,j = di,j(p−1∑m

p∑n=m+1

rm,n+(�−1) ∗ rk,l)/

((p2−p)/2+�−1)

.

(10)

Theb-levelof taskvi at thesth iteration is defined by

bs(vi) = wsi + Max

vj∈succ(vi )(csi,j + bs(vj )), (11)

wheresucc(vi) = {vj ∈ V |eij ∈ E} is the set of immediatesuccessors of taskvi . For the exit taskvexit, since it has nosuccessor, itsb-level is

bs(vexit) = wsexit. (12)

Based on the time-weights of the tasks and the time-weightsof the edges, the scheduling list is constructed with respectto theb-level.

3.4. Processor selection step

Kwok and Ahmad[15] compared several list schedulingalgorithms on a common homogeneous platform and con-cluded that insertion-based policy is better than the non-insertion-based policy during the processor selection step.Insertion-based policy permits the insertion of a task intoan earliest idle time slot between two tasks that are alreadyscheduled on the same processor. Hence, our algorithm as-signs the selected task to the processor which minimizes itsearliest finish time with an insert-based policy. The time slotmust be larger than the computation time of the being sched-uled task. In addition, the precedence constraint should be

preserved. The procedure for looking for an idle time sloton one processorpj for taskvi is as follows:

1. Compute the inner maximization block in Eq. (3) asready_time(vi, pj ), i.e., the time when all data neededby taskvi has arrived at processorpj

2. Avaiable(vi, pj ) = the finish time of the last task inthe task list of processorpj .

3. While ready_time(vi, pj ) < start time of the last taskin the task list of processorpj && the task list ofprocessorpj is not emptydo

4. if (the finish time of the second last task>=ready_time(vi, pj )) && (the start time of the lasttask− the finish time of the second last task)>=wi,j then

5. Avaiable(vi, pj ) = the finish time of the secondlast task

6. else if (the finish time of the second last task<ready_time(vi, pj )) && (the start time of the lasttask− ready_time(vi, pj )) = wi,j then

7. Avaiable(vi, pj ) = ready_time(vi, pj )8. end if9. Delete the last task from the task list of processor

pj10. end while11. EST(vi, pj ) = max(avaiable(vi, pj ),

ready_time(vi, pj ))

where the task list of processorpj consists of the taskswhich have been assigned to the processorpj and sorted byascending finish time.

3.5. The procedure of the algorithm

The procedure for the iterative scheduling algorithm is asfollows:

1. s = 0.2. Compute the time-weights of the tasks with Eq. (7)3. Compute the time-weights of the edges with Eq. (8)4. BestSL= a very large number5. while s�smax do6. Compute theb-levelsfor all tasks by traversing

graph from the exit task7. Sort the tasks into a scheduling list by non-

increasing order ofb-level8. while the scheduling list is not emptydo9. Remove the first taskvi from the scheduling

list10. for each processorpj do11. ComputeEFT(vi, pj ) using the insertion-

based scheduling policy12. end for13. Assign taskvi to the processor that minimize

EFT of vi14. end while


15. ScheduleLength= EFT(vexit, Pvexit)

16. if ScheduleLength< BestSLthen17. BestSL= ScheduleLength, and the current

schedule is the best schedule18. end if19. Compute the time-weights of the tasks with

Eq. (9)20. Compute the time-weights of the edges with

Eq. (10)21. s = s + 122. end while23. Return the best schedule

The initial step is the same as the heterogeneous earliest-finish-time (HEFT) algorithm [29] which significantly out-performs dynamic-level scheduling (DLS) algorithm [27],mapping heuristic (MH) [8] and levelized-min time (LMT)algorithm [11] in terms of average schedule length ratio,speedup and so on. The improvement step of our algorithmhas the potential to produce shorter schedule length thanthose of the HEFT algorithm, DLS algorithm, MH, LMTalgorithm.

3.6. The time-complexity analysis

The time complexity of scheduling algorithms for DAGis usually expressed in terms of the number of nodev, thenumber of edgese, and the number of processorsp. Thetime-complexity analysis for one iteration of our algorithmis as follows.

Computing the time-weights of the tasks and the edgescan be done in timeO(vp). Computing theb-levelscan bedone in timeO(e + v). Sorting the tasks can be done intimeO(v logv). The processor selection for all tasks can bedone in timeO((ep+v2/2)+vp), i.e., in timeO(ep+v2).Hence, the time complexity for one iteration is

O(vp + (e + v) + v logv + ep+ v2) = O(ep+ v2).

If smax denotes the maximum the number of iterations whichis normally small, then the time complexity of the wholealgorithm isO(smax(ep+ v2)) in the worst case.

For a dense graph when the number of edges isproportional to O(v2), the time complexity becomesO(smax(v

2p)).

4. Numerical example

Fig. 1 shows a DAG with eight tasks and 11 edges. Thereare two processors available in the heterogeneous computingsystem. Table 1 shows the computation time of each taskon every processor. For simplicity, we assume homogeneouscommunication and the communication times are as labeledon the edges in Fig. 1.

The time-weights of the tasks are computed by using Eq.(7) and the results are shown in Table 2. Theb-levelsof

1

3 2 4 5

6 7

8

85 79 100 66

87

58 95

55 86 70 46

Fig. 1. A sample directed acyclic graph with 8 tasks.

Table 1Computation times of every task on every processor

Task 1 2 3 4 5 6 7 8

P1 70 68 78 89 30 66 25 94P2 84 49 96 26 88 86 21 36

Table 2Time-weights of the tasks andb-levelsduring initial step

Task 1 2 3 4 5 6 7 8

w0i

77 58.4 87 57.5 59 76 23 65

b0(vi ) 512 344.5 356 310.5 170 199 183 65

Table 3Start time and finish time of every task during initial step

Task 1 3 2 4 6 7 5 8

P1 0–70 70–148 300–325 148–178P2 155–204 204–230 230–316 420–456

the tasks are shown in Table2. See the appendix for thecomputation details. The initial scheduling list of the tasksis {v1, v3, v2, v4, v6, v7, v5, v8}.

The appendix provides the details for the processor selec-tion procedure.

With the insertion policy we obtain the task schedule.Table 3 shows the start time and finish time of all the tasks.We also note from Table 3 that initial schedule length is 456.

For the first iteration, we select 4 as the weighting factor.Then the time-weights of the tasks are computed as follows:

w11 = (4 ∗ 70+ 84)/(4 + 2 − 1) = 72.8.

Because taskv1 was assigned to processorp1 during theinitial scheduling, the processorp1 is given more weight


Table 4Time-weights of the tasks andb-levelsduring first iteration

Task 1 2 3 4 5 6 7 8

w1i

72.8 52.8 81.6 38.6 41.6 82 24.2 47.6

b1(vi ) 448.2 182.4 266.2 275.4 135.2 129.6 166.8 47.6

Table 5Start time and finish time of every task during first iteration

Task 1 4 3 2 7 5 6 8

P1 0–70 70–159 159–237 237–262 262–292 292–358 358–452P2 155–204

Table 6Time-weights of the tasks andb-levelsduring second iteration

Task 1 2 3 4 5 6 7 8

w1i

72.8 52.8 81.6 76.4 41.6 70 24.2 82.4

b1(vi ) 450 292.2 234 183 124 152.4 106.6 82.4

Table 7Start time and finish time of every task during second iteration

Task 1 2 3 4 6 5 7 8

P1 0–70 70–138 138–216 216–282 282–307 330–424P2 170–196 196–284

when the time-weight of the taskv1 is computed duringthe first iteration. We believe that the weighted mean ofthe computation times of the task on every processor canrepresent the time-weight of the task better than the meanof the computation times of the task on every processor.

Table4 shows the updated time-weights andb-levelsofthe tasks.

The new scheduling list of the tasks is{v1, v4, v3, v2, v7,

v5, v6, v8}. Table 5 shows the start time and finish time of ev-ery task. We note from Table 5 that the new schedule lengthis 452, which is less than 456 from the initial schedule. Onepossible reason for this is that we have used the weightedmean of the computation time of the task on every processorto represent the time-weight of the task, and this has placedmore weight on the processor on which the correspondingtask was assigned during the immediately previous iteration.

For the second iteration, the time-weights of the tasks andtheb-levelsare as shown in Table 6.

The new scheduling list of the tasks is{v1, v2, v3, v4, v6,

v5, v7, v8}. Table 7 shows the start time and finish time ofevery task. We note from Table 7 that the current schedulelength is 424, which is less than 452 obtained during firstiteration.

By the exhaustive search algorithm, we obtain the optimalschedule of 364 for this case. To compare the result of the

iterative algorithm with the optimal solution, we use thedegradation from the best[15] criteria, which is defined as:(the result− the best)/the best.

The degradation from the bestfor this case is(424−364)/364 or 16.67%. Compare this with that of the HEFTwhich is (456− 364)/364 or 25.27%. We observe that theiterative algorithm has improves the schedule after two iter-ations for this case.

5. Performance analysis based on randomly generatedapplication graphs

In order to analyze the performance of our algorithm, werandomly generate some application graphs. Our objectiveis to study the amount of improvement to the initial schedulelength that can be achieved by our iterative algorithm.

5.1. Generation of random application graphs

The random graph generator requires some input and thenoutput the weighted directed acyclic graph, the computationtimes of every task at every processor, the communicationrate every link, and the data transfer size between tasks. Theinput of the random graph generator is as follows:

• Number of tasks(v).• Height of the DAG(h): The v tasks are randomly parti-

tioned intoh levels.• The link density(ß): The probability (Pl(i,j)) that there

is a directed link from the tasks of leveli to the tasks oflevel j is

Pl(i,j) = �(j − i)

, (13)

wherej > i, i, j ∈ (1, h).• Number of processors(p).• The maximum computation time (Cmax) and the minimum

computation time (Cmin): The computation time of everytask on every processor is a uniform random variable onthe interval(Cmin, Cmax).

• The maximum communication rate (Rmax) and the mini-mum communication rate (Rmin): The communication rateri,j between processorpi and processorpj is a uniformrandom variable on the interval(Rmin, Rmax).

• Communication to computation time ratio (CCR): It is theratio of the average communication time to the averagecomputation time. The average communication times be-tween two tasks on every link is a uniform random vari-able on the interval(CCR∗Cmin,CCR∗Cmax). Then thedata transfer size between tasks can be obtained.

5.2. Comparison with optimal solutions

When the solution space is not very large, we can obtainthe optimal solutions by the exhaustive search algorithm.


Table 8The parameters the base example

Number of tasks 8Number of processors 2DAG height 4Minimum computation time 20Maximum computation time 100Minimum communication rate 1Maximum communication rate 4Weighting factor 4Number of iteration 5

Table 9The parameters for DAG and scheduling

Number of tasks 40Number of processors 3DAG height 10Minimum computation time 20Maximum computation time 100Minimum communication rate 1Maximum communication rate 4Number of iteration 5

With the parameters in Table8, the link density is variedfrom 0 to 1 with increments of 0.1; the CCR varied throughthe values 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5 and 10. Werun the algorithm 1000 times and compute the “degradationfrom the best” value for each case. The results show that theaverage “degradation from the best” is 7.44%.

5.3. Simulation results

We investigate how the various parameters of the algo-rithm will impact the degree to which the initial sched-ules are improved through the iterative steps. We define theschedule length improvement ratioas

ri = li − lf

li, (14)

where li is the initial schedule length andlf is the finalschedule length.

With the parameters shown in Table9, the weighting fac-tor is varied from 0 to 10 with increments of 1, and thenvaried as 20, 100, 1000; the link density is varied from 0 to1 with increments of 0.1; and the CCR varied through thevalues 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5 and 10. The sim-ulation was run 1000 times under each case, resulting in atotal of 10,780,000 runs.

The results showed that 71.30% of the cases results inimproved schedule and the average improvement ratio is5.4%.

5.4. Sensitivity analysis of link density, weighting factorand CCR

To investigate how the link density impacts the results, wecompute the percentage of improved cases and the average

50%

60%

70%

80%

90%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Link Density

Per

cen

tag

e o

f Im

pro

ved

C

ases

Fig. 2. Percentage of improved cases varies with the link density.

0%

1%

2%

3%

4%

5%

6%

7%

8%

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Link Density

Ave

rag

e Im

pro

vem

ent

Rat

io

Fig. 3. Average improvement ratio varies with the link density.

improvement ratios with various link density levels. Theresults are shown in Figs.2 and 3, respectively.

Figs. 2 and 3 show that when the link density is variedfrom 0 to 0.1, the percentage of improved cases and the av-erage improvement ratio first increase, and then graduallydecrease with increase of link density. When the link densityis 0, most tasks are independent, which means that comput-ing the time-weights of the edges in the iterative steps havehardly any impact on the schedule. Hence the percentage ofimproved cases and the average improvement ratio, whenthe link density is 0, are both lower than those when thelink density is 0.1. With the link density increasing, how-ever, the task dependencies have more and more impact ontheb-levels. At the same time, computing the time-weightsof the edges in the iterative steps has less impact on theb-levels. Therefore, the percentage of improved cases and theaverage improvement ratio will gradually decrease when thelink density is increased.

To investigate how the weighting factor impacts the re-sults, we compute the percentage of improved cases and theaverage improvement ratio for a series of weighting factors.The results are shown in Figs. 4 and 5.

Figs. 4 and 5 show that when weighting factor is 0, boththe percentage of improved cases and the average improve-ment ratio are the lowest except when the weighting factoris 1. When the weighting factor equal to 0, the equation forcomputing the time-weight of taskvi and the time-weight ofthe edge from taskvi to taskvj during sth iteration reduce


0%10%20%30%40%50%60%70%80%90%

100%

0 1 2 5 9 10 20 100 1000

Weighting Factor

Per

cen

tag

e o

f Im

pro

ved

Cas

es

3 4 6 7 8

Fig. 4. Percentage of improved cases varies with the weighting factor.

0%1%2%3%4%5%6%7%8%

Ave

rag

e Im

pro

vem

ent

Rat

io

0 1 2 5 9 10 20 100 1000

Weighting Factor3 4 6 7 8

Fig. 5. Average improvement ratio varies with the weighting factor.

to the following two equations:

wsi =

m=p∑m=1,m�=k

wi,m

/(p − 1), (15)

csi,j = di,j(p−1∑m

p∑n=m+1

rm,n − rk,l

)/((p2 − p)/2 − 1)

. (16)

A weighting factor of 0 value means that duringsth iterationthe time-weights of the tasks are computed by ignoring theprocessor to which the corresponding task was assigned inthe preceding iteration; the time-weights of the edges arecomputed by ignoring the link between the two processors towhich the corresponding tasks are assigned in the precedingiteration.

When the weighting factor is equal to 1, the equationsfor computing the time-weights of the tasks and the time-weights of the edges during iterations are the same as thoseduring initial step. Therefore, the final schedule is the sameas the initial one.

When the weighting factor is increased from 2 to 20, thepercentage of improved cases and the average improvementratio have trivial differences. This means that the final sched-ule is not sensitive to the weighting factor.

When the weighting factor is equal to 100 or higher, thepercentage of improved cases and the average improvementratio have a decreasing trend. When the number of proces-sors is far less than the weighting factor, the equation forcomputing the time-weight of taskvi and the time-weightof the edge between taskvi and taskvj during iterations

50%

60%

70%

80%

90%

100%

0.1 0.25 0.5 0.75 1 2.5 5 7.5 10

CCR

Per

cen

tag

e o

f Im

pro

ved

Cas

es

Fig. 6. Percentage of improved cases varies with the CCR.

0%

2%

4%

6%

8%

10%

CCR

Ave

rag

e Im

pro

vem

ent

Rat

io

0.1 0.25 0.5 0.75 1 2.5 5 7.5 10

Fig. 7. Average improvement ratio varies with the CCR.

reduce to the following two equations:

wsi = wi,k, (17)

csi,j = di,j /rl,k. (18)

This means that the time-weight of taskvi is the computa-tion time of taskvi on the processor to which the task wasassigned during the preceding iteration; the time-weight ofthe edge from taskvi to taskvj is the communication timeof corresponding tasks on the processors to which the cor-responding tasks were assigned in the preceding iteration.

The followings discuss how the CCR impacts the results.We compute the percentage of improved cases and the av-erage improvement ratio with different CCR levels. The re-sults are as shown in Figs.6 and 7, respectively. Fig. 7 showsthat the average improvement ratio gradually increases whenCCR is increased, but Fig. 6 shows that the percentage ofimproved cases gradually decreases when CCR is increased.Therefore, when CCR is large we take a higher risk that theiterative steps do not improve the final schedule length, butwe will obtain higher average improvement ratio if the finalschedule length is indeed less than the initial one.

5.5. Sensitivity analysis of the task number and theprocessor number

Based on the parameters in Table 10, the weighting factoris varied from 0 to 10 with increments of 1, and then variedas 20, 100, 1000; the link density is varied from 0 to 1 withincrements of 0.1; the CCR is varied as 0.1, 0.25, 0.5, 0.75,


Table 10The parameters for DAG and scheduling

Number of processors 3DAG height 10Minimum computation time 20Maximum computation time 100Minimum communication rate 1Maximum communication rate 4Number of iteration 5

0%

10%

20%

30%

40%

50%

60%

70%

80%

1 2 3 7 13 20 27 33

Task Number/Processor Number

Per

cen

tag

e o

f Im

pro

ved

Cas

es

Fig. 8. Percentage of improved cases varies with task number/processornumber.

0%

1%

2%

3%

4%

5%

6%

1 2 3 7 13 20 27 33


Ave

rag

e Im

pro

vem

ent

Rat

io

Fig. 9. Average improvement ratio varies with task number/processornumber.

1, 2.5, 5, 7.5 and 10; the task number is varied as 3, 6, 10,20, 40, 60, 80, 100. The simulation is run 100 times undereach case. We compute the percentage of improved casesand the average improvement ratio with task number 3, 6,10, 20, 40, 60, 80 and 100. The results are shown in Figs.8and 9, respectively.

Fig. 8 shows that when task number over processor num-ber ratio is very small, the percentage of improved cases isvery small, i.e., the iterative steps can hardly improve theinitial schedule. But when the ratio is increased, the per-centage of improved cases is increased. When the ratio is13 or greater, the percentage of improved cases reaches themaximum and stops increasing.

Fig. 9 shows that average improvement ratio has the sametrend as the percentage of improved cases when the tasknumber over processor number ratio is less than or equalto 13. The improvement ratio, however, begins to decreasewhen the task number over processor number ratio is greater.

0%

10%

20%

30%

40%

50%

60%

70%

80%

1.00 1.25 1.67 2.50 5.00 10.00 16.67 33.33


Per

cen

tag

e o

f Im

pro

ved

Cas

es

Fig. 10. Percentage of improved cases varies with task number/processornumber.

0%

1%

1%

2%

2%

3%

3%

4%

4%

5%

1.00 1.25 1.67 2.50 5.00 10.00 16.67 33.33


Ave

rag

e Im

pro

vem

ent

Rat

io

Fig. 11. Average improvement ratio varies with task number/processornumber.

We repeated the above simulation except that the tasknumber is fixed at 100 and the processor number is variedas 3, 6, 10, 20, 40, 60, 80, and 100. The results are shownin Figs.10 and 11.

Fig. 10 shows that the percentage of improved cases in-creases when the task number over processor number ratiois increased. When the task number over processor num-ber ratio exceeds some value, the percentage of improvedcases reaches the maximum and stops increasing. The trendis similar to that in Fig. 8. This means that the iterative al-gorithm is more effective when task number over processornumber ratio is large.

Fig. 11 shows that the average improvement ratio alwaysincreases when the task number over processor number ratiois increased, which is different from Fig. 9.

6. Performance analysis on application graphs ofreal-world problems

Using real applications to test the performance of al-gorithms is very common [28–31]. Hence, in addition torandomly generated DAGs, we also ran the iterative algo-rithm on two real-world problems: a digital signal process-ing (DSP) example [30] and Gaussian elimination [31].

6.1. DSP

We select a DSP example to test the iterative algorithmbecause the computation time and the communication data


0.0%

20.0%

40.0%

60.0%

80.0%

100.0%

120.0%

2 4 6 8 10 20 30 40 50 60 70 80

Processor Number

Per

cen

tag

e o

fIm

prov

emen

t C

ases

Fig. 12. Percentage of improved cases varies with processor number.

2.80%

3.00%

3.20%

3.40%

3.60%

3.80%

4.00%

Ave

rag

e Im

pro

vem

ent

Rat

io

2 4 6 8 10 20 30 40 50 60 70 80

Processor Number

Fig. 13. Average improvement ratio varies with processor number.

can be estimated very accurately. There are 119 tasks inthe DSP task graph. The task graph of the DSP and theparameters for the DSP can be found in[30]. In this case, wejust change the CCR value and the processor number. TheCCR is varied as 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5 and 10.The processor number is varied as 2, 4, 6, 8, 10, 20, 30, 40,50, 60, 70, and 80. We selected the appropriate weightingfactor and ran the algorithm 1000 times in each case. Theresults are shown in Figs. 12 and 13.

Fig. 12 shows that when the processor number is small, thepercentage of improved cases is very close to 1, which meansthat in the majority of the cases there is improvement in theinitial schedule. However, the percentage of improved casesdecreases slightly when the processor number is increased.The trend is the same as the results obtained for the randomlygenerated DAGs. The results therefore confirm again thatthis iterative algorithm is more effective when task numberover processor number ratio is large.

Fig. 13 shows that there is no clear relation between theaverage improvement ratio and the task number, which isdifferent from the result obtained from the randomly gener-ated DAGs.

6.2. Gaussian elimination

The task graph of the Gaussian elimination, with a matrixsize of 5, can be found in [31]. The total number of the tasksfor this case is equal tom

2+3m2 − 2, wherem is the matrix

size. We use a Gaussian elimination with a matrix size of50, and the total number of the tasks is therefore 1323. TheCCR varied as 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5 and 10. Theprocessor number varied from 2 to 40 with increments of

0%

20%

40%

60%

80%

100%

2 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

Processor Number

Per

cen

tag

e o

fIm

pro

ved

Cas

es

4 6 8

Fig. 14. Percentage of improved cases varies with processor number.

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

2 6 10 14 18 22 26 30 34 38

Processor NumberA

vera

ge Im

prov

emen

t R

atio

Fig. 15. Average improvement ratio varies with processor number.

2. We selected the appropriate weighting factor and ran thealgorithm 1000 times in each case. The results are shown inFigs.14 and 15.

Fig. 14 shows that the percentage of improved cases isvery closed to 1 and is basically the same when processornumber is varied from 2 to 20, but decreases sharply whenthe processor number exceeds 20. Again, the results supportthe earlier conclusion that the iterative algorithm is moreeffective when task number over processor number ratio islarge.

Fig. 15 shows that average improvement ratio increaseswith the percentage of improved cases when the processornumber is less than 20. The improvement ratio, however,begins to decrease when the processor number is greaterthan 20. This is the same as the result obtained from therandomly generated DAGs.

7. Conclusion

In this paper an iterative list scheduling algorithm for theheterogenous distributed computing systems is proposed andstudied. We select bottom-level (b-level) as priority to con-struct the scheduling list. Theb-levelsare computed with themean of the computation times of a task on every processorand the mean of the communication times of an edge on ev-ery link during the initial step, and with the weighted meanduring the iterations. The processor selection step uses theinsertion-based policy that considers the possible insertionof a task to an idle time slot between two already-scheduledtasks. The initial step of our algorithm is the same as theHEFL [29]. However, the iterative algorithm can produce,through subsequent iterations, shorter schedule length than


those of the HEFT algorithm[29], DLS algorithm [27], MH[8], LMT algorithm [11].

We determined the percentage of cases that result in animproved final schedule and the average improvement ratiowith randomly generated task graphs under various param-eters and two real applications. It is observed that when thetask number over processor number ratio is very small, theiterative algorithm does not perform well; but when the tasknumber over processor number ratio is greater than somevalue, an improvement in the final schedule is obtained inmost cases that were simulated.

Sensitivity analysis is carried out and it shows that thepercentage of final schedule length is less than the initial oneand the average improvement ratio are both not sensitive tothe weighting factor, that is, the weight when computing themean during iteration.

Appendix.

For the numerical example in Section 4, during the initialstep theb-levelsof the tasks are computed as follows:

b0(v8) = w08 = 65,

b0(v7)=w07 + (c0

7,8 + b0(v8))

= 23+ (95+ 65) = 183,

...

b0(v1)=w01 + max{(c0

1,2 + b0(v2)), (c01,3 + b0(v3)),

(c01,4 + b0(v4)), (c

01,5 + b0(v5))}

= 77+ max{(85+ 344.5), (79+ 356),

(100+ 310.5), (66+ 170)}= 512.

For the numerical example in Section4, during the initialstep the processor selection procedure is as follows:

EST(v1, p1) = 0,

EST(v1, p2) = 0,

EFT(v1, p1) = 70,

EFT(v1, p2) = 84,

EFT(v1, p1) < EFT (v1, p2),

so taskv1 is assigned to processorp1.

EST(v3, p1) = max{70,EFT(v1, p1) + c1,3,1,1}= max{70, (70+ 0)} = 70,

EST(v3, p2) = max{0,EFT(v1, p1) + c1,3,1,2}= max{0, (70+ 79)} = 149,

EFT(v3, p1) = w3,1 + EST(v3, p1) = 78+ 70 = 148,

EFT(v3, p2) = w3,2 + EST(v3, p2) = 96+ 149= 245,

EFT(v3, p1) < EFT(v3, p2), sov3 is assigned to

processorp1.

...

There are some special case when assign taskv5. There isa time slot on processorp1 between taskv3 and taskv7,which have been already assigned on processorp1; and thetime slot is larger than the computation time of taskv5 onprocessorp1. Hence, the earliest time when processorp1is available for taskv5 execution is the time just after taskv3 finishes execution—148, not the time just after taskv7finishes execution—325:

EST(v5, p1) = max{148,max(EFT(v1, p1) + c1,5,1,1}= max{148,70} = 148,

EST(v5, p2) = max{316,max(EFT(v1, p1) + c1,5,1,2}= max{316, (70+ 66)} = 316,

EFT(v5, p1) = w5,1 + EST(v5, p1) = 30+ 148= 178,

EFT(v5, p2) = w5,2 + EST(v5, p2) = 88+ 316= 404,

EFT(v5, p1) < EFT(v5, p2), so taskv5 is assigned

to processorp5.

EST(v8, p1)= max{325,max{(EFT(v5, p1) + c5,8,1,1),

(EFT(v6, p2) + c6,8,2,1), (EFT(v7, p1)

+c7,8,1,1}}= max{325,max{(178+ 0), (316+ 58),

(325+ 0)}}= 374,

EST(v8, p2)= max{316,max{(EFT(v5, p1) + c5,8,1,2),

(EFT(v6, p2) + c6,8,2,2), (EFT(v7, p1)

+c7,8,1,2}}= max{316,max{(178+ 46),316,

(325+ 95)}}= 420,

EFT(v8, p1) = w8,1 + EST(v8, p1) = 94+ 374= 468,EFT(v8, p2) = w8,2 + EST(v8, p2) = 36+ 420= 456,EFT(v8, p1) > EFT(v8, p2), so taskv8 is assigned to

processorp2.

References

[1] I. Ahmad, Y.-K. Kwok, On exploiting task duplication in parallelprogram scheduling, IEEE Trans. Parallel Distrib. Systems 9 (9)(1998) 872–892.

[2] H. Chen, B. Shirazi, J. Marquis, Performance evaluation of anovel scheduling method: linear clustering with task duplication, in:Proceedings of International Conference on Parallel and DistributedSystems, 1993, pp. 270–275.


[3] Y.C. Chung, S. Ranka, Application and performance analysisof a compile-time optimization approach for list schedulingalgorithms on distributed-memory multiprocessors, in: Proceedingsof Supercomputing ‘92, 1992, pp. 512–521.

[4] J.Y. Colin, P. Chretienne, C.P.M. scheduling with small computationdelays and task duplication, Oper. Res. 39 (4) (1991) 680–684.

[5] S. Darbha, D.P. Agrawal, Optimal scheduling algorithm fordistributed-memory machines, IEEE Trans. Parallel Distrib. Systems9 (1) (1998) 87–95.

[6] M.K. Dhodhi, I. Ahmad, A. Yatama, I. Ahmad, An integratedtechnique for task matching and scheduling onto distributedheterogeneous computing system, J. Parallel Distrib. Comput. 62 (9)(2002) 1338–1361.

[7] A.R. Diaz, A. Tchernykh, K.H. Ecker, Algorithms for dynamicscheduling of unit execution time tasks, European J. Oper. Res. 146(2) (2003) 403–416.

[8] H. El-Rewini, T.G. Lewis, Scheduling parallel program tasks ontoarbitrary target machines, J. Parallel Distrib. Comput. 9 (2) (1990)138–153.

[9] M.R. Gary, D.S. Johnson, Computers and Intractability: A Guideto the Theory of NP-Completeness, W.H. Freeman and Co., SanFrancisco, CA, 1979.

[10] J.J. Hwang, Y.C. Chow, F.D. Anger, C.Y. Lee, Scheduling precedencegraphs in systems with interprocessor communication times, SIAMJ. Comput. 18 (2) (1989) 244–257.

[11] M. Iverson, F. Ozuner, G. Follen, Parallelizing existing applicationsin a distributed heterogeneous environment, in: Proceedings ofHeterogeneous Computing Workshop, 1995, pp. 93–100.

[12] O.H. Kang, D.P. Agrawal, Scalable scheduling for symmetricmultiprocessors (SMP), J. Parallel Distrib. Comput. 63 (3) (2003)273–285.

[13] D. Kim, B.G. Yi, A two-pass scheduling algorithm for parallelprograms, Parallel Comput. 20 (6) (1994) 869–885.

[14] Y.-K. Kwok, I. Ahmad, Dynamic critical-path scheduling: an effectivetechnique for allocating task graphs onto multiprocessors, IEEETrans. Parallel Distrib. Systems 7 (5) (1996) 506–521.

[15] Y.-K. Kwok, I. Ahmad, Benchmarking and comparison of the taskgraph scheduling algorithms, J. Parallel Distrib. Comput. 59 (3)(1999) 381–422.

[16] Y.-K. Kwok, I. Ahmad, Static scheduling algorithms for allocatingdirected task graphs to multiprocessors, ACM Comput. Surveys 31(4) (1999) 406–471.

[17] M. Maheswaran, H.J. Siegel, A dynamic matching and schedulingalgorithm for heterogeneous computing systems, in: Proceedings ofHeterogeneous Computing Workshop, 1998, pp. 57–69.

[18] C. McCreary, H. Gill, Automatic determination of grain size forefficient parallel processing, Comm. ACM 32 (9) (1989) 1073–1078.

[19] Mehdiratta, K. Ghose, A bottom-up approach to task scheduling ondistributed memory multiprocessor, in: Proceedings of InternationalParallel Processing, II, 1994, pp. 151–154.

[20] C. Oguz, M.F. Ercan, T.C.E. Cheng, Y.F. Fung, Heuristic algorithmsfor multiprocessor task scheduling in a two-stage hybrid flow-shop,European J. Oper. Res. 149 (2) (2003) 390–403.

[21] M.A. Palis, J.-C. Lieu, D.S.L. Wei, Task clustering and schedulingfor distributed memory parallel architectures, IEEE Trans. ParallelDistrib. Systems 7 (1) (1996) 46–55.

[22] C.H. Papadimitriou, M. Yannakakis, Towards an architecture-independent analysis of parallel algorithms, SIAM J. Comput. 19 (2)(1990) 322–328.

[23] C.-I. Park, T.-Y. Choe, An optimal scheduling algorithm based ontask duplication, IEEE Trans. Comput. 51 (4) (2002) 444–448.

[24] H.J. Park, B.K. Kim, An optimal scheduling algorithm for minimizingthe computing period of cyclic synchronous tasks on multiprocessors,J. Systems Software 56 (3) (2001) 213–229.

[25] A. Radulescu, A.J.C. van Gemund, Low-cost task scheduling fordistributed-memory machines, IEEE Trans. Parallel Distrib. Systems13 (6) (2002) 648–658.

[26] V. Sarkar, Partitioning and Scheduling Parallel Programs forMultiprocessors, MIT Press, Cambridge, MA, 1989.

[27] G.C. Sih, E.A. Lee, A compile-time scheduling heuristic forinterconnection-constrained heterogeneous processor architectures,IEEE Trans. Parallel Distrib. Systems 4 (2) (1993) 75–87.

[28] S. Srinivasan, N.K. Jha, Safety and reliability driven task allocationin distributed systems, IEEE Trans. Parallel Distrib. Systems 10 (3)(1999) 238–251.

[29] H. Topcuoglu, S. Hariri, M.-Y. Wu, Performance-effective and low-complexity task scheduling for heterogeneous computing, IEEETrans. Parallel Distrib. Systems 13 (3) (2002) 260–274.

[30] C.M. Woodside, G.G. Monforton, Fast allocation of processes indistributed and parallel systems, IEEE Trans. Parallel Distrib. Systems4 (2) (1993) 164–174.

[31] M.-Y. Wu, D.D. Gajski, Hypercool: a programming aid for message-passing systems, IEEE Trans. Parallel Distrib. Systems 1 (3) (1990)330–343.

[32] T. Yang, A. Gerasoulis, DSC: scheduling parallel tasks on anunbounded number of processors, IEEE Trans. Parallel Distrib.Systems 5 (9) (1994) 951–967.

G.Q. Liu received his master degree fromTsinghua University, China, in 1998. Heis currently working towards the Ph.D. de-gree at National University of Singapore.His research interests include modeling andscheduling for distributed computing sys-tems, distributed system reliability, multi-objective optimization, and parallel algo-rithms.

K.L. Poh received his Ph.D. in Enginee-ring—Economic Systems in 1993 fromStanford University. He is active in teachingand research in the area of decision analysis,decision systems, and operations researchand has authored numerous papers in theseareas and has co-authored a book ComputingSystems Reliability” published by KluwerAcademic Publishers. He was President ofthe Operational Research Society of Singa-pore from 1999 to 2002. He currently serveson the editorial board of the Asia-Pacific

Journal of Operational Research and was an active program comitteemember of the Uncertainty in Artificial Intelligence Annual Conferenceseries.

M. Xie received his Ph.D. in Quality Tech-nology in 1987 from Linkoping Universityin Sweden. He is active in teaching and re-search in the area of quality and reliabilityproblems. He was awarded the prestigiousLKY research fellowship in 1991. He hasauthored numerous papers and six books,including “Computing Systems Reliability”published by Kluwer Academic Publish-ers, and “Weibull Models” by John Wiley& Sons. He serves on the editorial boardof IEEE Transactions on Reliability, IIE

Transactions, Quality Engineering, and several other international journals.Dr Xie is a senior member of ASQ, IEEE and IIE.

Documents

Iterative list scheduling for heterogeneous computing