22
Chapter 9 Pipeline and Vector Processing Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2009

Lec18 pipeline

Embed Size (px)

DESCRIPTION

Computer system arichtecture

Citation preview

Page 1: Lec18 pipeline

Chapter 9 Pipeline and Vector Processing

Dr. Bernard Chen Ph.D.University of Central Arkansas

Spring 2009

Page 2: Lec18 pipeline

Parallel processing

A parallel processing system is able to perform concurrent data processing to achieve faster execution time

The system may have two or more ALUs and be able to execute two or more instructions at the same time

Goal is to increase the throughput – the amount of processing that can be accomplished during a given interval of time

Page 3: Lec18 pipeline

Parallel processing classification

Single instruction stream, single data stream – SISD

Single instruction stream, multiple data stream – SIMD

Multiple instruction stream, single data stream – MISD

Multiple instruction stream, multiple data stream – MIMD

Page 4: Lec18 pipeline

Single instruction stream, single data stream – SISD

Single control unit, single computer, and a memory unit

Instructions are executed sequentially. Parallel processing may be achieved by means of multiple functional units or by pipeline processing

Page 5: Lec18 pipeline

Single instruction stream, multiple data stream – SIMD

Represents an organization that includes many processing units under the supervision of a common control unit.

Includes multiple processing units with a single control unit. All processors receive the same instruction, but operate on different data.

Page 6: Lec18 pipeline

Multiple instruction stream, single data stream – MISD

Theoretical only

processors receive different instructions, but operate on same data.

Page 7: Lec18 pipeline

Multiple instruction stream, multiple data stream – MIMD

A computer system capable of processing several programs at the same time.

Most multiprocessor and multicomputer systems can be classified in this category

Page 8: Lec18 pipeline

Pipelining: Laundry Example

Small laundry has one washer, one dryer and one operator, it takes 90 minutes to finish one load:

Washer takes 30 minutes Dryer takes 40 minutes “operator folding” takes

20 minutes

A B C D

Page 9: Lec18 pipeline

Sequential Laundry

This operator scheduled his loads to be delivered to the laundry every 90 minutes which is the time required to finish one load. In other words he will not start a new task unless he is already done with the previous task

The process is sequential. Sequential laundry takes 6 hours for 4 loads

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

90 min

Page 10: Lec18 pipeline

Efficiently scheduled laundry: Pipelined LaundryOperator start work ASAP

Another operator asks for the delivery of loads to the laundry every 40 minutes!?. Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 2040 40 40

Page 11: Lec18 pipeline

Pipelining Facts Multiple tasks

operating simultaneously

Pipelining doesn’t help latency of single task, it helps throughput of entire workload

Pipeline rate limited by slowest pipeline stage

Potential speedup = Number of pipe stages

Unbalanced lengths of pipe stages reduces speedup

Time to “fill” pipeline and time to “drain” it reduces speedup

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20

The washer waits for the dryer for 10

minutes

Page 12: Lec18 pipeline

9.2 Pipelining• Decomposes a sequential process into

segments.

• Divide the processor into segment processors each one is dedicated to a particular segment.

• Each segment is executed in a dedicated segment-processor operates concurrently with all other segments.

• Information flows through these multiple hardware segments.

Page 13: Lec18 pipeline

9.2 Pipelining Instruction execution is divided into k

segments or stages Instruction exits pipe stage k-1 and

proceeds into pipe stage k All pipe stages take the same amount of

time; called one processor cycle Length of the processor cycle is determined

by the slowest pipe stage

k segments

Page 14: Lec18 pipeline

9.2 Pipelining

Suppose we want to perform the combined multiply and add operations with a stream of numbers:

Ai * Bi + Ci for i =1,2,3,…,7

Page 15: Lec18 pipeline

9.2 Pipelining

The suboperations performed in each segment of the pipeline are as follows:

R1 Ai, R2 Bi

R3 R1 * R2 R4 Ci

R5 R3 + R4

Page 16: Lec18 pipeline
Page 17: Lec18 pipeline
Page 18: Lec18 pipeline

Pipeline Performance

n:instructions k: stages in

pipeline : clockcycle Tk: total time

))1(( nkTk

)1(1

nk

nk

T

TSpeedup

k

n is equivalent to number of loads in the laundry examplek is the stages (washing, drying and folding.Clock cycle is the slowest task time

n

k

Page 19: Lec18 pipeline

SPEEDUP • Consider a k-segment pipeline operating on n

data sets. (In the above example, k = 3 and n = 4.)

> It takes k clock cycles to fill the pipeline and get the first result from the output of the pipeline.

After that the remaining (n - 1) results will come out at each clock cycle.

> It therefore takes (k + n - 1) clock cycles to complete the task.

Page 20: Lec18 pipeline

SPEEDUP

If we execute the same task sequentially in a single processing unit, it takes (k * n) clock cycles.

• The speedup gained by using the pipeline is:

S = k * n / (k + n - 1 )

Page 21: Lec18 pipeline

SPEEDUP S = k * n / (k + n - 1 )

For n >> k (such as 1 million data sets on a 3-stage pipeline),

S ~ k So we can gain the speedup which is

equal to the number of functional units for a large data sets. This is because the multiple functional units can work in parallel except for the filling and cleaning-up cycles.

Page 22: Lec18 pipeline

Example: 6 tasks, divided into 4 segments 1 2 3 4 5 6 7 8 9

T1 T2 T3 T4 T5 T6

T1 T2 T3 T4 T5 T6

T1 T2 T3 T4 T5 T6

T1 T2 T3 T4 T5 T6