Multiprocessor Design: - Peoplekubitron/courses/cs152-F99/proj… · Web viewThe multiprocessor issues two instructions at one time so that we can theoretically go twice as fast

CS 152 Project

A Team of Five, and Two Processors Gift For Granny

Group: Sut-Hong (Erica) Chan Tun-I (Ada) Kok Celia La John Pham Sheng-Liang Song

TA: Victor Wen

Date: Dec 9, 99

Table of Contents

1) Introduction and Summary

(a) Feature summary of subprojects: (1) Multiprocessing

(2) Second-level Cache

(3) Stream Buffer

(4)Write Buffer

(b) Overall top-level block diagram of processor

(c) performance summary for the final test programs

2) All Features Description

(a) (1)Multiprocessor Design


(3) Stream Buffer

(4) Write Buffer

(b) Extra Feature

3. Performance Summary

a) Critical Path

i) Top 3 critical paths

ii) Latencies

b) Performance Analysis i) Compare with Lab6 processor ii) Statistics

iii) Explanations for better/worse performance

4) Testing Philosophy

Appendix: Block Sumary, Test Programs, VHDL and Online Logs

1) (a) Feature Summary of Subprojects In lab6, we introduced the real memory, as opposed to the ideal one, and we know that

memory accesses will take up a lot of our CPU time. Therefore, we try to minimize memory

accesses in this lab. We have chosen a number of subprojects expecting that will help to improve

our processor performance.

(1) Multiprocessor

For the larger subproject, we have chosen to implement multiprocessing. The

multiprocessor issues two instructions at one time so that we can theoretically go twice as

fast. The caches from each CPU is shared with each other for coherency, with the side

effect that they provide cache hits to one another.


The second level cache sits between the 1st level cache and the DRAM. It has 4-word

cache lines. Cache misses in the 1st level cache will result in accesses to the 2nd level

cache rather than the main memory.

We have decided that the 2nd level cache should only interface with the data cache but not

the instruction cache. This is because the instructions are sequential and the stream buffer

would have helped to decrease the penalty for a miss in the instruction cache. Whereas,

the data cache stores data for random memory locations.

(3) Stream Buffer

The stream buffer will be put between the first level cache and the second level

cache.When our datapath requests an instruction from the first level cache and results in a

miss, it will look for the instruction in the stream buffer. A miss in the stream buffer will

result in filling 4 word blocks from the second level cache and reading 2 word blocks to

the first level cache.

(4) Write Buffer

The write buffer sits between the first level cache and the arbiter. When a 2-word block

is kicked out from the 1st level cache, it is first placed in the write buffer before written

back to memory. This temporary storage is beneficial because the processor can begin

executing other instructions while the memory controller can take care of writing the

contents of the write buffer to memory. In this way, the pipeline doesn’t need to be

stalled and can do more useful work.

b) Overall top-level block diagram of the processsor

( c) Performance summary

If we run the multiprocessor design on final_1.s, the total number of cycles is

reduced from 25000 downto 11000. The hit rate is also increased a lot since the the

cache from each processor combines to form a bigger one which provides more space for

keeping the data and results in more hits. If we run the design with the write buffer on

final_1.s, we see that it improves the performance by 3%, saving a total of 869 cycles.

This is lower than we have expected.

2) All Feature Description(1) Multiprocessor We are planning to use two processors with one memory space. The problem of sharing

the memory space and using write-back caches is that the caches may be inconsistent.

Suppose the two caches are sharing a piece of data. When we write a value in one cache,

we need to update the other, or else both caches will have inconsistent data. The caches

we really have to worry about are the two data caches because they are written to.

Most multi-processor designs keep a "snoop" unit attached to each cache which updates

or invalidates entries that are dirtied by other caches. These snoop units look at other

caches and when they have to, they take over caches they monitor to make an update or

invalidation. To optimize the updates in our two processor design, the snoop units are

built directly into the caches. Our caches can essentially read or write two values at the

same time. Our two caches will be combined to act like one cache with two read/write

ports.

If we only have one copy of each entry in the caches, then we never have to waste time

making sure values are consistent. The tricky part is in making sure that there are no

copies of values in the two caches. This means:

if the other cache has a hit, then use its copy of the data rather than trying to read in

our own copy

when a value we need is being written back in the other cache, wait until it is finished

if the other cache will be reading a value that we want, wait for it to finish reading

that value, and use only one copy

if both of we miss on the same value, let one guy fetch the value while the other guy

waits; this requires a special SNOOP ARBITRATION unit so that both caches don't

wait

We also update the monitor so that it will check for the following:

for each address, there should be only ONE hit between the caches; more than one hit

means an entry is copied in more than than one place

a cache should only get a snoop stall in the idle state; if it has already begun its

writeback and there is a conflict later on, then the other cache should be the one to

wait

(2)Second_level Cache

The Process Logic

The process for the second-level can be broken into the following steps:1) Respond to asynchrous inouts (this signal output response depends on the current state)2) On clock falls, determine the next state and the initial outputs for that state

Here are the Input/Output signals for the Level 2 Cache:

request requestr_w r_waddress addressdin din

wait waitdout dout

Whenever one of the above Input signals changes, the Output signals are immediately set, depending on the current state. When the clock actually falls, the Level-2 Cache takes the values of the Input signals at that moment, using them to determine the next state. After a state transtions into a new state, the Output signals are initialized.

The Cache Memory Architecture

The second-level cache is a 16-block fully-associative cache, with each block holding 4 words of data. Because it implements a Write-Back policy, each block needs a dirty bit as well as a valid bit. In the case where the Arbiter has cache missed, and the Replacement Policy has chosen a dirty block, we would need to write the cache block as well as retrieve the requested data block. In order to reduce the stall time of the CPU, we also added a Write-Buffer within the second-level cache.

Level 2 CacheArbiter Mem Ctrl

Feature ValueCache-Type Fully-AssociativeBlocks 16Block size 128-b (4 words)Write Policy Write-BackReplacement Policy Clock (has ctr which increments)Extra Features Write-Buffer

Performance Analysis and Impact

We were having timing problems with the integration of the cache and the multiprocessor, so we were not able to get total misses and hits data. Because we were not able to integrate it with the rest of the system, we could not see how much of an improvement it made in terms of decreasing the overall CPI. It should, theorhectially, improve CPI and Execution Time.

Design Decisions

There was a lot of time put into making the second-level cache have good debugging output. However, as a consequence, there was less time than was needed in order to debug the component and integrate it with the rest of the system.

What WorksThe level-two cache currently works outside of the processors (i.e.: connected to the Lab 7 Memory Controller, and manually inputting the signals). This test, which verifies the state transitions and component outputs, contains the following files:

sch\L2cache_test.1 (schematics 1-5)cmd\L2cache_test.cmd - reads data from the DRAM into all 16 blocks of the level-2 cachecmd\L2_dirtyhalf - writes to (and dirties) the 8 lower blocks in the cachecmd\L2_dirtyhalf2 - dirties the upper-half of the cachecmd\L2_readner - tests the kickout stage (i.e.: putting a block in the cache’s write-

buffer, and waiting to write to memory until the request’s read has been serviced)

cmd\L2cache_test.out - output of the above simulation

State Diagram of Second-level cache

(3) Stream BufferThe stream buffer is designed in a way that should have the same interface as an arbiter

from last lab. Doing this will make our integration easier. We can just take out the arbiter

and put in the stream buffer.

Designing the stream buffer is straight forward. There are three states that we keep track

of: IDLE, REQUEST and WAIT. A request that results in a miss will put the stream

buffer in the REQUEST state. It will then request to the second level cache which in turn

sends a wait signal to the stream buffer, causing it to go to the WAIT state. Then after the

4 word blocks are returned from the cache to the stream buffer, it will return to the IDLE

state.

State Diagram of Stream buffer

(4) Write BufferThe write buffer sits between the first level cache and the arbiter. When a 2-word block

is kicked out from the 1st level cache, it is first placed in the write buffer before written

back to memory. This temporary storage is beneficial because the processor can begin

executing other instructions while the memory controller can take care of writing the

contents of the write buffer to memory. In this way, the pipeline doesn’t need to be

stalled and can do more useful work.

The write buffer has a complicated state machine. This takes up most of our time in this

lab. It has a total of eight states and a lot of inputs to keep track of. Refer to the state

diagram on the inputs that trigger the transitions and also how that would affect the inputs

to the memory controller depending on whether a read or write request should be

serviced.

State Diagram of Write Buffer

b) Extra features

Synchronization Mechanisms for multiprocessor

We have 16 special synchronization variables,that is mapped to Address[-16,-1].

We build test&set on the top these sych_variables.If one cpu want to access a share data, it needto talk to grap a lock first. Graping lock is desgin as a test&set with load word operation tothe special address (-16, -1). Once an cpu finishits job, it can release a lock by store 0 to thesynchronization variable.

Pipleline Control has already code the instructionand saved in the pipleine regester. We need do furtherdecode for the address since address(-16, -1) have thecomplete different meaning. They are used as synchronizationvariables. They will not physically saved in dram (if so, it is going to be too slow to grap a lock.)

The 16 one bits regesters will be internally inside of the synchronization units. Every lw/sw instruction will gothrough synchronization unit to check if it is synchronizedvariables. If so, sych_unit will do the test&set(lw)or free lock(sw) operation. Otherwise, it does nothing atall: simplely pass it Memory access units (Caches and Drams).

If it is test&set operation, sych_unit will decode the addressline and select a singal sych_cell to handle the request. Like a SRAM units, it decodes the address, select the data froma cell, and then select the indexed data to output. Here, we do the same thing. After each sych_cell return lock replay information, we select the final result output.

How a synch cell handle the lock request?

Here is a truth talk of all possiable input to a synch cell:

We define lw operatiron as 0, and sw operation as 1. OPA means the operation rquest from the EXE stage of CPUA; OPB means the operation request from the EXE statge of CPUB. EnA means A is enable or the request from CPUA is valid (a sych request.).

Input value is the input value for the single cell, which is calculated based on opcode and enable singal (opA, opB, enA, enB).

(x: means don’t care; v means the value that stored in the sych cell.)

For example, when CPU A and CPU B wants to grap the lock at the same time. Who should I give the lock? The sych cell while handle it based on the table above. In this case, it is (opA, opB, enA, enB) := (0,0,1,1). (as arrow show above).

Both request a lock signal, as of (0,0,1,1), will be use to a singal state machine to toggle and select a CPU, gving the lock to it. (The method gives a kind of fareness between two CPUs.

Dram Controller

In lab6, we use a bust mode dram-controller, that get two words each time. For lab7, we desgin a memory controller that handle different length of cycle times. One trick we played is: passing the cycle time as a genric argument to the memory control units, the same method as we set up a delay time for each parts. Then, we interleave four DRAM together to get the best brandwidth (128bits, 4 words) since we have the second level cache (4 words catch line), write buffer(4 words), and stream buffer(4 words ).

3) Performance Summary

a) Critical PathIn lab 6, our longest path was through the memory components, since

a memory request propogated through all the middle layers of memory until it reached the DRAM; then the DRAM’s wait signal was propogated all the way back to our datapath. If we would have modified our memory components such that each component enabled a wait signal upon receiving a request signal, then this would have significantly decreased this path time. Instead, the ALU would have been our critical path for Lab 6.

i) For lab 7, since memory access took up so much of our processing time, we decided to concentrate on the memory instead of the ALU time. Because of this, our critical path is now through the ALU in the Execution Stage.Our second-longest path would be in the Memory Access Stage, on a cache miss. This is because we have implemented so many levels of arbitration and caches, that a signal going from the first-level cache to the DRAM gets delayed for each level.Our third-longest path would have to be the Decode Stage, since that was on of our critical paths from a previous lab, and we have not done anything to speed up this stage.

ii) Latencies :1) Hit in first-level cache: 13.7ns 2) Miss in first-level cache and hit in second-level cache: 19.7ns3) Miss in both first and second level caches: 119.7ns

b) Performance Analysis We have built a number of monitors to measure the number of stalls, hits and the total number of requests. After running the benchmark program, final.s , we obtained the following measurements for our lab6 processor with write-through cache and also our lab7 multiprocessor with write-back cache.

datapath: (Lab6 processor with write through policy)Cycle: 28809 Code Stalls: 14639 Data Stalls: 7299 Hazard Stalls: 1033Ram requests: 17219/28809 = 59.8%Instructions: Read Hits: 5407/6871 Write Hits: 0/0Data: Read Hits: 1188/1500 Write Hits: 0/502total requests(instructions) = 5838, read hits = 4374, write hit =0# of lw & sw instructions = 1500, read hit = 1188, write hit = 0Read Miss= 1776, Write Miss = 502

mp_single: (write back policy)Cycle: 25271 Code Stalls: 12681 Data Stalls: 5719 Hazard Stalls: 1033 CPU::B cycle 25271 => data hazard...addu r0, r1, r0 Write: 0x00000000 At: R0Instructions: Read Hits: 4560/5838 Write Hits: 0/0Data: Read Hits: 1214/1500 Write Hits: 373/502Ram requests: 16328/25271 = 64.61%total requests(instructions) = 5838, read hits = 4560, write hit =0# of lw & sw instructions = 1500, read hit = 1214, write hit = 373 /502Read Miss = 1564, Write Miss = 129

mp_gfg (Lab7 multiprocessor and write back policy)CPUA:::Cycle: 11816 Code Stalls: 3029 Data Stalls: 2383 CPUB:::Cycle: 11816 Code Stalls: 2941 Data Stalls: 4569 Ram requests: 8288/11816 = 70.1%total # of misses < with one processortotal requests(instructions) = 9068, read hits = 8619, write hit =0# of lw & sw instructions = 2651, read hit = 2363, write hit = 375 /504Read Miss = 737, Write Miss = 129

Statistics1. Measure of misses:The multiprocessor, with shared caches, is going to help a lot with reducing misses cache. We see about a 50% reduction of read misses since the 2 processors share code and can find instructions in each other's cache. For write misses, it stays the same since they don’t share the same set of data:

2. Measure of stalls: One interesting observation is that for the multiprocessor, there are more data stalls than code stalls. This is reasonable, because both processors share code,(some subroutines) and can find instructions in each other's cache..however, they manipulate a different set of data which may not be in common,hence resulting in more data stalls.

3. Measure of RAM Usage:

It's also interesting to look at the ram usage of a multiprocessor, as expected. It should go up because two processors, instead of one, are trying to access the memory.

4) Testing PhilosophyWe use bottom up approach and test piece by piece. We tried to keep the individual

components with the same interfaces so as to make our integration easier. We also built a

number of monitors for error checking .

Stream BufferTo test the stream buffer, we tie the first level cache to one of the stream buffer’s

interface. Since the second level cache is not yet ready, we tie two memories to the

stream buffer to provide the 128-bits interface. A command file is written which just

keeps on requesting instructions by assigning different values to the address vector. We

load the mystery program from lab 5 into the memory so that we can check whether we

are reading out the right instructions.

Second-level CacheLike we have always did, we always test our design individually before hooking it up

with other parts of the design. We’ve also specifically added in a lot of debug signals to

help the debug process easier. We’ve also wrote a cache-dump command that will display

the contents of the cache.

Write BufferThe write buffer is tied to 1st level cache and the DRAM to make sure it is working. A

command file is written to trigger the writing of cache entries to write buffer and finally

to memory. Things get a little tricky because in the command file, we have to be careful

so that we actually create scenarios where write hits are followed by read and write

misses, causing data to be written back.

MultiprocessorWe ran the design first in single processor mode with lab 5 mystery. Our improved

design, with optimized arbitration and a writeback LRU cache, saves about 15% accesses

in this lab over the original lab 6 processor.

We discovered that there was a bug in the wait arbitration unit. The wait arbitration unit

always gave signal "A" the priority over signal "B", even in the middle of one of B's

requests.

We updated the design to have two states, one where A had priority and one where B had

priority so that B wouldn't get interrupted. These states change after a CPU's request to a

DRAM has gone low. So while a cache is accessing the DRAM it's request signal should

always be high. Even when doing a series of requests like a write-back and a read.

We should have a monitor assert that a cache can never be interrupted by a SNOOP

WAIT signal while accessing the DRAM.

One problem that we noticed is that the CPU stalls one more cycle than it needs to on a

miss. Although the data is ready on cycle N, the CPU waits for it to get written into the

CACHE and doesn't read it until cycle N+1.

We finished unit testing of multi-processor cache, then made another copy of the

datapath to simulate the multi-processor CPU in single processor mode.

The design "mp_single" successfully runs "lab5_mystery.s" without problems.

This is good since this means we didn't break anything. Results of running the

multiprocessor on "final_1" from last year are the same as when running it on the lab 6

design which is expected to work. While this test does not stress data sharing, does no

synchronization (as per our special addresses), and is not a good test of a multiprocessor,

it helps us to be confident that our design is working ok. The synchronization was tested

heavily by Sheng before integration into the datapath.

To make the debug process easier, we have written a command file which dumps the

cache entries to memory. In this way, we will be able to tell whether the result differs

from the expected output just by dumping the memory contents. This is helpful because

in a write-back cache, we don’t write back all things to our memory.

Documents

Multiprocessor Design: - Peoplekubitron/courses/cs152-F99/proj… · Web viewThe multiprocessor issues two instructions at one time so that we can theoretically go twice as fast