Upload
diana-mitchell
View
220
Download
0
Embed Size (px)
DESCRIPTION
Outline Pipeline Memory Hierarchy
Citation preview
Exam2 Review
Dr. Bernard Chen Ph.D.University of Central Arkansas
Spring 2010
Outline Pipeline Memory Hierarchy
Parallel processing A parallel processing system is able to perform
concurrent data processing to achieve faster execution time
The system may have two or more ALUs and be able to execute two or more instructions at the same time
Goal is to increase the throughput – the amount of processing that can be accomplished during a given interval of time
Parallel processing classification
Single instruction stream, single data stream – SISD
Single instruction stream, multiple data stream – SIMD
Multiple instruction stream, single data stream – MISD
Multiple instruction stream, multiple data stream – MIMD
9.2 Pipelining Instruction execution is divided into k
segments or stages Instruction exits pipe stage k-1 and
proceeds into pipe stage k All pipe stages take the same amount of
time; called one processor cycle Length of the processor cycle is determined
by the slowest pipe stage
k segments
SPEEDUP If we execute the same task
sequentially in a single processing unit, it takes (k * n) clock cycles.
• The speedup gained by using the pipeline is:
)1(1
nknk
TTSpeedupk
Example A non-pipeline system takes 100ns to
process a task; the same task can be processed in a
FIVE-segment pipeline into 20ns, each Speedup Ratio for 1000 tasks: 100*1000 / (5 + 1000 -1)*20 = 4.98
However, if the task cannot be evenly divided…
Example A non-pipeline system takes 100ns to
process a task; the same task can be processed in a six-
segment pipeline with the time delay of each segment in the pipeline is as follows 20ns, 25ns, 30ns, 10ns, 15ns, and 30ns.
Determine the speedup ratio of the pipeline for 10, 100, and 1000 tasks. What is the maximum speedup that can be achieved?
Example Answer Speedup Ratio for 10 tasks:
100*10 / (6+10-1)*30
Speedup Ratio for 100 tasks:100*100 / (6+100-1)*30
Speedup Ratio for 1000 tasks:100*1000 / (6+1000-1)*30
Maximum Speedup:100*N/ (6+N-1)*30 = 10/3
Instructions seperate 1. Fetch the instruction 2. Decode the instruction 3. Fetch the operands from
memory 4. Execute the instruction 5. Store the results in the proper
place
5-Stage PipeliningFetch
Instruction (FI)
FetchOperand
(FO)
Decode Instruction
(DI)
WriteOperand
(WO)
Execution Instruction
(EI)
S3 S4S1 S2 S5
1 2 3 4 98765S1S2
S5
S3S4
1 2 3 4 87651 2 3 4 765
1 2 3 4 651 2 3 4 5
Time
Pipeline Hazards There are situations, called hazards,
that prevent the next instruction in the instruction stream from executing during its designated cycle
There are three classes of hazards Structural hazard Data hazard Branch hazard
Data hazard
Example:ADD R1R2+R3SUB R4R1-R5AND R6R1 AND R7OR R8R1 OR R9XOR R10R1 XOR R11
Data hazardFO: fetch data value WO: store the executed
value Fetch
Instruction (FI)
FetchOperand
(FO)
Decode Instruction
(DI)
WriteOperand
(WO)
Execution Instruction
(EI)
S3 S4S1 S2 S5
Time
Data hazard Delay load approach inserts a no-operation
instruction to avoid the data conflict
ADD R1R2+R3No-opNo-opSUB R4R1-R5AND R6R1 AND R7OR R8R1 OR R9XOR R10R1 XOR R11
Data hazard
Data hazard It can be further solved by a simple hardware technique
called forwarding (also called bypassing or short-circuiting)
The insight in forwarding is that the result is not really needed by SUB until the ADD execute completely
If the forwarding hardware detects that the previous ALU operation has written the register corresponding to a source for the current ALU operation, control logic selects the results in ALU instead of from memory
Data hazard
Branch hazards Branch hazards can cause a greater
performance loss for pipelines
When a branch instruction is executed, it may or may not change the PC
If a branch changes the PC to its target address, it is a taken branch
Otherwise, it is untaken
Branch hazards There are FOUR schemes to
handle branch hazards Freeze scheme Predict-untaken scheme Predict-taken scheme Delayed branch
Branch Untaken (Freeze approach) The simplest method of dealing with branches
is to redo the fetch following a branch
Fetch Instruction
(FI)
FetchOperand
(FO)
Decode Instruction
(DI)
WriteOperand
(WO)
Execution Instruction
(EI)
Branch Taken (Freeze approach) The simplest method of dealing with branches
is to redo the fetch following a branch
Fetch Instruction
(FI)
FetchOperand
(FO)
Decode Instruction
(DI)
WriteOperand
(WO)
Execution Instruction
(EI)
Branch Untaken (Predicted-untaken)
Fetch Instruction
(FI)
FetchOperand
(FO)
Decode Instruction
(DI)
WriteOperand
(WO)
Execution Instruction
(EI)
Time
Branch Taken (Predicted-untaken) Fetch
Instruction (FI)
FetchOperand
(FO)
Decode Instruction
(DI)
WriteOperand
(WO)
Execution Instruction
(EI)
Branch Untaken (Predicted-taken) Fetch
Instruction (FI)
FetchOperand
(FO)
Decode Instruction
(DI)
WriteOperand
(WO)
Execution Instruction
(EI)
Branch taken (Predicted-taken) Fetch
Instruction (FI)
FetchOperand
(FO)
Decode Instruction
(DI)
WriteOperand
(WO)
Execution Instruction
(EI)
Delayed Branch A fourth scheme in use in some
processors is called delayed branch It is done in compiler time. It modifies
the code
The general format is:
branch instructionDelay slotbranch target if taken
Delayed Branch Optimal
Outline Pipeline Memory Hierarchy
Memory Hierarchy The main memory occupies a central position by being
able to communicate directly with the CPU and with auxiliary memory devices through an I/O processor
A special very-high-speed memory called cache is used to increase the speed of processing by making current programs and data available to the CPU at a rapid rate
RAM
ROM
Memory Address Map
Cache memory When the CPU refers to memory and finds
the word in cache, it is said to produce a hit
Otherwise, it is a miss
The performance of cache memory is frequently measured in terms of a quantity called hit ratio
Hit ratio = hit / (hit+miss)
Cache memory The basic characteristic of cache memory is its
fast access time, Therefore, very little or no time must be
wasted when searching the words in the cache The transformation of data from main memory
to cache memory is referred to as a mapping process, there are three types of mapping:
Associative mapping Direct mapping Set-associative mapping
Associative mapping
Direct Mapping
Direct Mapping
Set-Associative Mapping
Average memory access time
Average memory access time = % instructions * (Hit_time + instruction miss
rate*miss_penality) + % data * (Hit_time + data miss rate*miss_penality)
Average memory access time Assume 40% of the instructions are data
accessing instruction. Let a hit take 1 clock cycle and the miss
penalty is 100 clock cycle Assume instruction miss rate is 4% and data
access miss rate is 12%, what is the average memory access time?
60% * (1 + 4% * 100) + 40% * (1 + 12% * 100)
= 0.6 * (5) + 0.4 * (13)= 8.2 (clock cycle)
Page Fault
Performance of Demand Paging Page Fault Rate 0 ≤p≤1.0 if p= 0 no page faults if p= 1, every reference is a fault
Effective Access Time (EAT)=(1-p)*ma + p*page fault time
9.4 Page Replacement What if there is no free frame?
Page replacement –find some page in memory, but not really in use, swap it out
In this case, same page may be brought into memory several times
9.4 Page Replacement Many Approaches:
FIFO Optimal Page-Replacement Algorithm Least-recently-used (LRU) Second-Chance Algorithm Least Frequently used (LFU) page-
replacement algorithm Most frequently used (MFU) page-
replacement algorithm