Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005

1

Pipelined and Parallel Computing

Partition for

Hongtao DuAICIP Research

Nov 3, 2005

2

Motivation

• Emergence of pipelined and parallel computing in multiprocessor systems.

• VLSI technologies with extensive parallel and pipelining computation capabilities have triggered the algorithm implementations directly on FPGA and ASIC.

• Performance impacts:– Number and type of computing resources– Interconnection and communication mechanisms– Partition among resources

3

Partitioning

• Definition: divide processes/data among computing resources such that critical processes / larger data sets are implemented on faster resources, while the less critical processes / smaller data sets implemented on slower and cheaper resources.

• Goal: optimizing the overall performance corresponding to resource constraints such as speed, power, and area.

• Spatial partitioning and temporal partitioning

4

Partition Scheme

5

Dependency

• Chain structure – Computation are divided into several stages, each of which

perform one operation of the whole procedure. – Partition task: keep the processing loads of all stages

roughly equal.– Serial computing and pipelined computing

• Tree structure– Independent branches are sent to multiple resources in

homogenous or heterogeneous computing environments. Each resource performs operations at the same time.

– Parallel computing and distributed computing

6

Combinations

• Chain-structured pipelined programs over chain-connected systems

• Multiple chain-structured parallel and pipelined programs over single-host multiple-satellite systems

• Multiple arbitrarily structured serial programs over single-host multiple-satellite system

• Single-tree structured parallel programs over single-host multiple identical satellites systems

7

Chain structure

• Partition principles– Challenge:

– Load balance– Shorten the execution time of the bottleneck stage

– Minimize overall computing time– Computing– Communication

8

Modeling – Layered Graph

• Definition– Node <i, j>: sub-process

– <i, j> connects to <j+1, *>

– Number of node: m: processn:resource

– Weight– Node: computing time– Edge: communication time

mji 1

nm2

9

Minimum Bottleneck Path

• Pre-definitions:– Label L(i) for each node i to record the minimum, over

all paths ending at i, of the maximum edge (bottleneck) on each path at any stage.

– Label W(e) as the weight for the edge e connecting node a (above) and b (below).

• For each node b, replace L(b) by

• Complexity

)]}(),(max[),(min{ aLeWbL

)()( 3nmOmNO

10

Dynamic Programming

• Pre-definition– : time required to run process i on resource j– : amount of data to be transmitted from i to (i+1)– : transmission speed between j and (j+1)

– The total time on j : (k to i are assigned to resource j)

– : time consumed by bottleneck resource given the first i are assigned to the first j resources.

– : pointer to the first process on the resource, given the first i are assigned to the first j resources.

–

• Objective: minimize

i

kl j

ilj

jki v

ctT

ijt

ic

jv

ijf

jip

thj

mnf

11

• Algorithm1. Initialization

where

2. First resourceLet and for

3. Recursion (Bellman equation)

where k is the smallest index holding the above equality

i

kl j

ilj

jki v

ctT

nmikj nj ,,2,1

111 ii Tf 11 ip 1,,2,1 nmi

)},{max(min 1,1,,

jkijk

ijkij Tff

jnmjji ,,1, nj ,,3,2

kp ji

12

Continue …

4. Assignment– Assign processes to resource n– Set and decrease n by 1– Iterate until all processes are assigned.

• Complexity

– Step1: – Step2: – Step3: – Step4:

mpp nm

nm ,,1,

1 nmpm

)( 2nmO

)(mO

)( 2nmO

)( 2nmO

)(nO

13

Modeling - Doubly Weighted Graph

• Definition:

– Each edge e:

– : sum weight– : bottleneck weight

– Path (P): the processing flow

END ,

)(),( ee )(e)(e

14

Sum-Bottleneck (SB) Path

• Pre-definition:– S(P): sum weight of path P

– B(P): the bottleneck weight

• Algorithm: search for the optimal path between two nodes such that the sum-bottleneck (SB) weight is minimum.

: sum-bottleneck weight of path POptimal SB path has weight 10.

)]}(),(min{max[ PBPS

)](),(max[ PBPS

)()( iePS

)](max[)( iePB

)(S

0)( B

15

Optimal SB Path Algorithm

• Algorithm: bottleneck weight 1. Remove all edges with 2. Delete all remaining weights. DWG

has been transformed into an ordinary weighted graph

3. If there is a path between s and t, then return the shortest path P between s and t.

• Using Dijkstra’s algorithm to find the shortest path P.

• Complexity:

TB

TBe )(

)log( 2 enO

16

Tree structure

• Current process depends only on their parent process.

• Independence between children processes permits the burden distribution from single resource to multi-resources.

17

Modeling – Directed Dual Graph

Direction: from lower ordered node to higher

18

: execution time on host : execution time on slave (all slaves are similar) : inter-processor communication (communication between parent to grandparent)

ih

is

ijc

ijtreesub

ki cs

ijki ch

19

• Objective: Finding the optimal assignment from A to H.

• Algorithm: Optimal SB path

• Complexity analysis– Multi-graph: multiple edge connecting the same pair– Given m nodes, there are m edges

)log( 2 mmO

20

Driving force

• Data-driven– How to divide data sets into different sizes for

multiple computing resources – How to coordinate data flows along different

directions such that brings appropriate data to the suitable processors at the right time.

• Function-driven– How to perform different functions of one task

on different computing resources at the same time.

21

Resource constraints

• Multi-processor– Homogenous system– Heterogeneous system

• Hardware/software (HW/SW) co-processing– Process scheduling

• VLSI– the communication time is ignorable– Capacity

– RAM size for image processing

– Computation complexity affects both design size and delay. Reducing computation complexity is necessary.

– Decrease image size– Reduce RAM size, therefore matching FPGA capacity constraint– Reduce clock cycle numbers of sub-processes, and increase overall

processing frequency

22

Analysis Tools

• DWG

• DAG (Directed Acyclic Graph)

• CDFG (Control and Data Flow Graph)

• Petri Net PN := (P, T, I, O, M)P - places; T – transitions; I - input arc connections between places and transitions; O - output arc connections between transitions and places; M – the number of tokens.

23

To be continued ……

Thank you!

Documents

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005