46
Exam2 Review Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Embed Size (px)

DESCRIPTION

Outline Pipeline Memory Hierarchy

Citation preview

Page 1: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Exam2 Review

Dr. Bernard Chen Ph.D.University of Central Arkansas

Spring 2010

Page 2: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Outline Pipeline Memory Hierarchy

Page 3: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Parallel processing A parallel processing system is able to perform

concurrent data processing to achieve faster execution time

The system may have two or more ALUs and be able to execute two or more instructions at the same time

Goal is to increase the throughput – the amount of processing that can be accomplished during a given interval of time

Page 4: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Parallel processing classification

Single instruction stream, single data stream – SISD

Single instruction stream, multiple data stream – SIMD

Multiple instruction stream, single data stream – MISD

Multiple instruction stream, multiple data stream – MIMD

Page 5: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

9.2 Pipelining Instruction execution is divided into k

segments or stages Instruction exits pipe stage k-1 and

proceeds into pipe stage k All pipe stages take the same amount of

time; called one processor cycle Length of the processor cycle is determined

by the slowest pipe stage

k segments

Page 6: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

SPEEDUP If we execute the same task

sequentially in a single processing unit, it takes (k * n) clock cycles.

• The speedup gained by using the pipeline is:

)1(1

nknk

TTSpeedupk

Page 7: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Example A non-pipeline system takes 100ns to

process a task; the same task can be processed in a

FIVE-segment pipeline into 20ns, each Speedup Ratio for 1000 tasks: 100*1000 / (5 + 1000 -1)*20 = 4.98

However, if the task cannot be evenly divided…

Page 8: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Example A non-pipeline system takes 100ns to

process a task; the same task can be processed in a six-

segment pipeline with the time delay of each segment in the pipeline is as follows 20ns, 25ns, 30ns, 10ns, 15ns, and 30ns.

Determine the speedup ratio of the pipeline for 10, 100, and 1000 tasks. What is the maximum speedup that can be achieved?

Page 9: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Example Answer Speedup Ratio for 10 tasks:

100*10 / (6+10-1)*30

Speedup Ratio for 100 tasks:100*100 / (6+100-1)*30

Speedup Ratio for 1000 tasks:100*1000 / (6+1000-1)*30

Maximum Speedup:100*N/ (6+N-1)*30 = 10/3

Page 10: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Instructions seperate 1. Fetch the instruction 2. Decode the instruction 3. Fetch the operands from

memory 4. Execute the instruction 5. Store the results in the proper

place

Page 11: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

5-Stage PipeliningFetch

Instruction (FI)

FetchOperand

(FO)

Decode Instruction

(DI)

WriteOperand

(WO)

Execution Instruction

(EI)

S3 S4S1 S2 S5

1 2 3 4 98765S1S2

S5

S3S4

1 2 3 4 87651 2 3 4 765

1 2 3 4 651 2 3 4 5

Time

Page 12: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Pipeline Hazards There are situations, called hazards,

that prevent the next instruction in the instruction stream from executing during its designated cycle

There are three classes of hazards Structural hazard Data hazard Branch hazard

Page 13: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Data hazard

Example:ADD R1R2+R3SUB R4R1-R5AND R6R1 AND R7OR R8R1 OR R9XOR R10R1 XOR R11

Page 14: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Data hazardFO: fetch data value WO: store the executed

value Fetch

Instruction (FI)

FetchOperand

(FO)

Decode Instruction

(DI)

WriteOperand

(WO)

Execution Instruction

(EI)

S3 S4S1 S2 S5

Time

Page 15: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Data hazard Delay load approach inserts a no-operation

instruction to avoid the data conflict

ADD R1R2+R3No-opNo-opSUB R4R1-R5AND R6R1 AND R7OR R8R1 OR R9XOR R10R1 XOR R11

Page 16: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Data hazard

Page 17: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Data hazard It can be further solved by a simple hardware technique

called forwarding (also called bypassing or short-circuiting)

The insight in forwarding is that the result is not really needed by SUB until the ADD execute completely

If the forwarding hardware detects that the previous ALU operation has written the register corresponding to a source for the current ALU operation, control logic selects the results in ALU instead of from memory

Page 18: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Data hazard

Page 19: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Branch hazards Branch hazards can cause a greater

performance loss for pipelines

When a branch instruction is executed, it may or may not change the PC

If a branch changes the PC to its target address, it is a taken branch

Otherwise, it is untaken

Page 20: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Branch hazards There are FOUR schemes to

handle branch hazards Freeze scheme Predict-untaken scheme Predict-taken scheme Delayed branch

Page 21: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Branch Untaken (Freeze approach) The simplest method of dealing with branches

is to redo the fetch following a branch

Fetch Instruction

(FI)

FetchOperand

(FO)

Decode Instruction

(DI)

WriteOperand

(WO)

Execution Instruction

(EI)

Page 22: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Branch Taken (Freeze approach) The simplest method of dealing with branches

is to redo the fetch following a branch

Fetch Instruction

(FI)

FetchOperand

(FO)

Decode Instruction

(DI)

WriteOperand

(WO)

Execution Instruction

(EI)

Page 23: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Branch Untaken (Predicted-untaken)

Fetch Instruction

(FI)

FetchOperand

(FO)

Decode Instruction

(DI)

WriteOperand

(WO)

Execution Instruction

(EI)

Time

Page 24: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Branch Taken (Predicted-untaken) Fetch

Instruction (FI)

FetchOperand

(FO)

Decode Instruction

(DI)

WriteOperand

(WO)

Execution Instruction

(EI)

Page 25: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Branch Untaken (Predicted-taken) Fetch

Instruction (FI)

FetchOperand

(FO)

Decode Instruction

(DI)

WriteOperand

(WO)

Execution Instruction

(EI)

Page 26: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Branch taken (Predicted-taken) Fetch

Instruction (FI)

FetchOperand

(FO)

Decode Instruction

(DI)

WriteOperand

(WO)

Execution Instruction

(EI)

Page 27: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Delayed Branch A fourth scheme in use in some

processors is called delayed branch It is done in compiler time. It modifies

the code

The general format is:

branch instructionDelay slotbranch target if taken

Page 28: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Delayed Branch Optimal

Page 29: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Outline Pipeline Memory Hierarchy

Page 30: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Memory Hierarchy The main memory occupies a central position by being

able to communicate directly with the CPU and with auxiliary memory devices through an I/O processor

A special very-high-speed memory called cache is used to increase the speed of processing by making current programs and data available to the CPU at a rapid rate

Page 31: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

RAM

Page 32: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

ROM

Page 33: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Memory Address Map

Page 34: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010
Page 35: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Cache memory When the CPU refers to memory and finds

the word in cache, it is said to produce a hit

Otherwise, it is a miss

The performance of cache memory is frequently measured in terms of a quantity called hit ratio

Hit ratio = hit / (hit+miss)

Page 36: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Cache memory The basic characteristic of cache memory is its

fast access time, Therefore, very little or no time must be

wasted when searching the words in the cache The transformation of data from main memory

to cache memory is referred to as a mapping process, there are three types of mapping:

Associative mapping Direct mapping Set-associative mapping

Page 37: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Associative mapping

Page 38: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Direct Mapping

Page 39: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Direct Mapping

Page 40: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Set-Associative Mapping

Page 41: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Average memory access time

Average memory access time = % instructions * (Hit_time + instruction miss

rate*miss_penality) + % data * (Hit_time + data miss rate*miss_penality)

Page 42: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Average memory access time Assume 40% of the instructions are data

accessing instruction. Let a hit take 1 clock cycle and the miss

penalty is 100 clock cycle Assume instruction miss rate is 4% and data

access miss rate is 12%, what is the average memory access time?

60% * (1 + 4% * 100) + 40% * (1 + 12% * 100)

= 0.6 * (5) + 0.4 * (13)= 8.2 (clock cycle)

Page 43: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Page Fault

Page 44: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

Performance of Demand Paging Page Fault Rate 0 ≤p≤1.0 if p= 0 no page faults if p= 1, every reference is a fault

Effective Access Time (EAT)=(1-p)*ma + p*page fault time

Page 45: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

9.4 Page Replacement What if there is no free frame?

Page replacement –find some page in memory, but not really in use, swap it out

In this case, same page may be brought into memory several times

Page 46: Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010

9.4 Page Replacement Many Approaches:

FIFO Optimal Page-Replacement Algorithm Least-recently-used (LRU) Second-Chance Algorithm Least Frequently used (LFU) page-

replacement algorithm Most frequently used (MFU) page-

replacement algorithm