9
May 8, 2015 12:42 am EE457 Final - Spring 2015 1 / 9 C Copyright 2015 Gandhi Puvvada EE457 Final (~30%) Closed-book Closed-notes Exam; No cheat sheets; Calculators are allowed. Verilog Guides are not needed and are not allowed. Smart phones, tablets (and any kind of computing/internet devices) are not allowed. This is a Crowdmark exam. Please do not write on margins or on backside. Spring 2015 Instructor: Gandhi Puvvada Saturday, 5/9/2015 07:00 PM - 10:00 PM (3 Hour 00 min. = 180 min) Location: SGM124 Viterbi School of Engineering University of Southern California Ques# Topic Page# Time Points 1 FIFO and ROB labs 2-3 40 min. 75 2 Branch Prediction 4 30 min. 44 3 CMP, CMT, Cache Coherency 5-6 40 min. 71 4 Exceptions, Lab 7, pipelining general 7 20 min. 37 5 Tomasulo 8 20 min. 33 6 Virtual Memory 9 25 min. 43 Total Cover +8=9 175 min. 303 Perfect Score 290 Student’s Last Name: _______________________________________ Student’s First Name: _______________________________________ Student’s DEN Bb username: ______________________________ @usc.edu

EE457 Final (~30%)

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: EE457 Final (~30%)

May 8, 2015 12:42 am EE457 Final - Spring 2015 1 / 9 C Copyright 2015 Gandhi Puvvada

EE457 Final (~30%)Closed-book Closed-notes Exam; No cheat sheets;

Calculators are allowed. Verilog Guides are not needed and are not allowed.Smart phones, tablets (and any kind of computing/internet devices) are not allowed.

This is a Crowdmark exam. Please do not write on margins or on backside.

Spring 2015Instructor: Gandhi Puvvada

Saturday, 5/9/201507:00 PM - 10:00 PM (3 Hour 00 min. = 180 min)

Location: SGM124

Viterbi School of EngineeringUniversity of Southern California

Ques# Topic Page# Time Points

1 FIFO and ROB labs 2-3 40 min. 75

2 Branch Prediction 4 30 min. 44

3 CMP, CMT, Cache Coherency 5-6 40 min. 71

4 Exceptions, Lab 7, pipelining general 7 20 min. 37

5 Tomasulo 8 20 min. 33

6 Virtual Memory 9 25 min. 43

Total Cover +8=9 175 min. 303

Perfect Score 290

Student’s Last Name: _______________________________________

Student’s First Name: _______________________________________

Student’s DEN Bb username: [email protected]

Page 2: EE457 Final (~30%)

May 8, 2015 12:42 am EE457 Final - Spring 2015 2 / 9 C Copyright 2015 Gandhi Puvvada

1 ( 75 points) 40 min. The FIFO and the ROB labs

1.1 FIFO:

1.1.1 We can use _____________ (n-bit / (n+1)-bit) pointers either in the single-clock FIFO or in the 2-clock FIFO where as we can use _______________ (n-bit / (n+1)-bit) pointers only in the _______________________ (single-clock FIFO / 2-clock) FIFO. Explain. ________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

1.1.2 For a 32-location deep FIFO, since real depth can vary from completely empty to completely full, we have ____________ (31/32/33) depth values. The depth expression [(WP-RP)mod32] (where WP and RP are 5-bit pointers) produces ____________ (adequate/inadequate/more than adequate) number of values because ________________________________________________________ ____________________________________________________________________________The depth expression __________________________{ [(WP-RP)mod32] / [(WP-RP)mod64] } (where WP and RP are 6-bit pointers) produces ____________ (adequate/inadequate/more than adequate) number of values because ________________________________________________________ ____________________________________________________________________________ Show 6-bit example values (in binary) for WP and RP to illustrate your choice of mod32 or mod64 above and explain. ______________________________________________________________________________________________________________________________________________________________________________________________________________

1.1.3 In a general application of a FIFO to delink the Producer and the Consumer, if the FIFO is a little smaller than the desired value, then the _____________ (producer / consumer / both) end(s) up waiting often when the FIFO becomes _____________ (full / empty). ____________________ (Similarly / However) in the case of our FIFO lab part 2 (trying to arrange the even and odd numbers alternately), if the FIFO is a little smaller than desired _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

1.1.4 A new EE457 TA, Mr. Bruin, simulated the design submitted by the student, Mr. Trojan, and captured the 6-bit RPSS (RP double synchronized to Wclk) activity in a modelsim waveform along with the 6-bit WP and the Wclk, and displayed the values of WP and RPSS in decimal as shown above. Since the RPSS was jumping as shown below, he concluded that Mr. Trojan’s design is bad. Please advise Mr. Bruin. _________________________________________________________________________________________________________________________________________________________________________________________________________________________________

6pts

12pts

12pts

WclkRPSSWP

5922 23

195

8pts

Page 3: EE457 Final (~30%)

May 8, 2015 12:42 am EE457 Final - Spring 2015 3 / 9 C Copyright 2015 Gandhi Puvvada

1.2 ROB:

1.2.1 ____________ (Mr. Trojan/Mr. Bruin) says that a ROB with 3-locations is too shallow (is not deep enough) for this design. Explain: _________________________________________________________________________________________________________________________The rob_full signal ________ (will/will not) cause stalling the dispatch. In our lab (in the test bench OoE_Divider_tb.v), we have 32dividend-divisor pairs. So student A thought that a 32-deep ROB can handle the worst test data. His lab partner, student B, said, "but we do not have 32 single-dividers!" Do you agree with student A or B? _______. Explain: ____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

1.2.2 Irrespective of the test data, the order of dispatch is the same. T / FIrrespective of the test data, the order of graduation is the same. T / FIrrespective of the test data, the order of issue by the issue unit is the same. T / FIn our design, the ROB is 8-location deep and the test data contains 32dividend-divisor pairs.Irrespective of the test data, the WP goes from 0 to 7, 4 times (0-7, 0-7, 0-7, 0-7, 0), and finally comes back to zero. T / FIrrespective of the test data, the RP goes from 0 to 7, 4 times (0-7, 0-7, 0-7, 0-7, 0), and finally comes back to zero. T / F

1.2.3 In the ROB diagrams on the side, indicate the populated locations by shading ( ) them. If you have shaded more than 4 locations, explain how it is possible as there are only 4 single-dividers. ______________________________________________________________________________________________________________________________________

1.2.4 In our design, the WP and RP are ______ bits each and the rob_tag is ____ bits long. The ROB has ____ (0/1/2/3) Read-only ports, ____ (0/1/2/3) Write-only ports, and ____ (0/1/2/3) Read-Write ports.

12pts

7.5+2.5pts

bonus points= 4/10

01234567

WP

RP

01234567

WP

RP

9pts

5+1pts

Page 4: EE457 Final (~30%)

May 8, 2015 12:42 am EE457 Final - Spring 2015 4 / 9 C Copyright 2015 Gandhi Puvvada

2 ( 44 points) 30 min. Branch Prediction

2.1 _________ (Early / Late) branch ___________ (is likely to / will) cause more dependency stalls._________ (Early / Late) branch ___________ (is likely to / will) cause more branch penalty.Branch penalty refers to the clocks lost due to __________________ (flushing by/stalling due to/forwarding to) _________________________ (a taken branch / an untaken branch / any branch).If you are changing your earlier late branch design to an early branch design, it is _____________ (necessary/desirable) that you tell the compiler designer about this change so that ____________ ____________________________________________________________________________________________________________________________________________________________________________________________________________________________________

2.2 Branch direction prediction becomes more important in __________ (deeper / shallow) pipelines.Branch direction prediction becomes more important in __________ (out-of-order / in-order) executing pipelines.

2.3 Predict the future based on the past for target address prediction in IF stage works well with ___________ _________________________ (A only / B only / A as well as B /neither A nor B).A. conditional branches such as beq $1, $2, TargetB. jr $31 acting as a return instruction at the end of a subroutine.Explain: _____________________________________________________________________________________________________________________________________________________________________________________________________________________________

2.4 A 2-bit branch direction predictor tries to improve a specific common branch usage situation. Explain _________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

2.5 No aliasing if you are predicting from _______ (IF / ID) stage, but aliasing is OK if you are predicting from _______ (IF / ID) stage. Out of the two below ______________ (A/B/A and B/neither A nor B) can cause serious drop in performanceA. Predicting a non-branch instruction such as an ADD instruction as a taken branch and correcting laterB. Applying the prediction information of one branch to another branch

2.6 BPB (Branch Prediction Buffer) with depth =2K, needs K bits to index it. The K bits are correctly from the PC in the _________ (Left /Right)-side design. Explain: ___________________________________________________________________________________________________________________________________________________

2.7 Our Lab 6 design has no branch prediction. It is equivalent to predicting always ______ (taken/not-taken).

12pts

4pts

8pts

6pts

4.5+1.5pts

01010011

00

K-bits

K30-K PC

BPB 2K

01010011

00

K-bits

30-KK PC

BPB 2K

Left Right

6pts

2pts

Page 5: EE457 Final (~30%)

May 8, 2015 12:42 am EE457 Final - Spring 2015 5 / 9 C Copyright 2015 Gandhi Puvvada

3 ( 35 + 36 = 71 points) 40 min. CMP, CMT, Cache Coherency

3.1 In the two CMP (Chip Multi Processor) organizations shown below, the shared L2 cache is shown "banked" on the left but not on the right. Explain. ____________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

3.1.1 If there are 8 banks of L2 cache in the left side organization, is it true that there can be 8 copies of a block in the 8 banks besides 8 copies in the 8 L1 caches? True / Not trueExplain: _____________________________________________________________________________________________________________________________________________________________________________________________________________________________

3.2 If 4 threads in a multi-threaded core keep the common resources (I-cache, D-cache, ALU and other execution hardware) busy (i.e. fully utilized) and if you have a lot of extra silicon leftover, you will use the extra silicon to ________________________ (A/B/A or B/neither A nor B).A. Increase the number of threads in that coreB. Leave the number of threads per core at 4, but consider having many such 4-threaded cores Explain: _____________________________________________________________________________________________________________________________________________________________________________________________________________________________

3.3 MPI (Miss rate per instruction) is _______________ (almost always/usually) ________ (higher/lower) for L1 cache compared to L2 cache because ____ (L1/L2) is a subset of ____ (L1/L2) and _____ (L1/L2) is much larger.

3.4 Switching between threads in a fine-grain multi-threaded core wastes 1 or more clock per switch. T / FSwitching between threads because of data cache miss in one thread wastes 1 or more clock per switch. T / F

3.5 Non-blocking caches _____________ (reduce/increase) processor stalls due to cache misses. They buffer the misses and ___________ (continue/wait) to serve other access requests from _____________ (same thread/different threads). MSHR, standing for ________________________________, buffers the missed cache access until it is served. Non-blocking cache is helpful in _______________________________ (instruction cache only/data cache only/both). Non-blocking caches have two controllers, such as CCU and SCU, so that one attends to the new requests while the other attends to the misses. T / F

8pts

P0

L1$

P1

L1$

P7

L1$

Memory Interconnection Network

Shared (banked)L2$ L2$ L2 cache

P0

L1$

P1

L1$

P7

L1$

Shared L2 cache (no banks)

Bus

6pts

6pts

2.5+2.5pts

4pts

6pts

Page 6: EE457 Final (~30%)

May 8, 2015 12:42 am EE457 Final - Spring 2015 6 / 9 C Copyright 2015 Gandhi Puvvada

3.5.1 Consider the idea of using a non-blocking data cache in the following cases and select M (for makes sense) or DM (for does not make sense).(a) Simple in-order instruction executing pipelines of our lab 6 M / DM(b) Our Tomasulo part 1 (single-threaded IoI-OoE-OoC) M / DM(c) Our Tomasulo part 2 (single-threaded IoI-OoE-IoC) M / DM(d) Intel HTT (an example is our Tomasulo Part 2 with 2 threads) M / DM Explain the cases, if any, that you marked as DM.____________________________________ _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

3.6 In super-scalar machines with 2 pipes, you _________ (can/can not send) two instructions which are inter-dependent because _________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

3.7 MOESI state encoding: Fill-up the encoding table on the side for the 5 states.

3.8 State transitions in the diagrams below follow A.I. (All Inclusive) rules. Y / NState transitions in the diagrams below follow M.E. (Mutually Exclusive) rules. Y / NState transitions in the diagrams below take place in one clock. Y / NState transitions in the diagrams below are for ____________________________(Write-Through cache only/Write-Back cache only/Either/Neither).

3.9 The word "Flush" in the diagrams below means helping a neighbor L1 cache. It is _________________ (wrong/wasteful) to flush to the MM also.

Mark appropriate state transitions in the EE457 design with either R/FMM (meaning replacement causing flush to main memory) or R/-- (meaning replacement causing no flush to main memory). The R/FMM or R/-- markings for the EE557 design would be identical to EE457 markings. T / FWe ______________ (wish to / do not wish to) defer (postpone) updating the MM as farther in time as possible.

8pts

6pts

State

Property Code

5+1pts

6pts

10pts

EE557 EE457

M

O

I

PrRd(S)/

PrWr/BusUpgr

BusRd

PrWr/

PrRd/--

BusRdX

BusRd/

BusRd/--

PrRd/--

E

PrRd(S)/

PrWr/--

BusRdX/

S

BusRd/

PrWr/

BusUpgr/--BusRdX/Flush

PrWr/--

BusUpgr/--BusRdX/--BusRdX/

PrRd/--

PrRd/--

BusRdX/--BusUpgr/--

BusRd/Flush

Flush

Flush

BusRd

Flush

BusUpgr

Flush

M

O

I

PrRd(S)/

PrWr/BusUpgr

BusRd

PrWr/

PrRd/--

BusRdX

BusRd/

BusRd/--

PrRd/--

E

PrRd(S)/

PrWr/--

BusRdX/

S

BusRd/

PrWr/

BusUpgr/--BusRdX/Flush

PrWr/--

BusUpgr/--BusRdX/--BusRdX/

PrRd/--

PrRd/--

BusRdX/--BusUpgr/--

BusRd/Flush

Flush

Flush

BusRd

Flush

BusUpgr

Flush

Figure 5.12 State-transition diagram of a MOESI protocolCopyright 2012 Michel Dubois, Murali Annavaram and Per Stenström

Page 7: EE457 Final (~30%)

May 8, 2015 12:42 am EE457 Final - Spring 2015 7 / 9 C Copyright 2015 Gandhi Puvvada

4 ( 37 points) 20 min. Exceptions, pipelining general

4.1 Exceptions: Briefly explain: 8000_0180:___________________________________________________________________________________________Cause:___________________________________________________________________________________________________________________EPC:________________________________________________________________________________________________

4.2 In Lab7 P3 Subpart 4, we produced two STALL signals as declared on the side. ____________________ (STALL / STALL_combinational) produces a waveform easier to understand. Is it necessary to declare STALL_combinational as a wire?____________________________________________________________________________________________________________________________________________________

4.3 In RTL coding of a pipeline, forwarding muxes output in an "if" statement in a clocked always block is sometimes assigned using the blocking procedural assignment operator and sometimes the non-blocking. Explain using the extract of Lab 7 Part 1 figure on the side.________________________________________________________________________________________________________________________________________________________________________________________________________________________

4.4 On reset, in all except the ______ (IF/ID/EX/M/WB) stage, we make sure to start with a bubble.Bubble in _____ stage is achieved through setting or resetting the wrist-band (flush) FF where as the bubbles in the rest of the three stages with names _____________ are achieved through clearing the control signal FFs of the ___________________ (data stationary / control stationary) method of control.

4.5 Stalling ____________ (partial/full) pipeline requires injection of bubbles, where as stalling ____________ (partial/full) pipeline does not require injection of bubbles. We wish to solve data dependencies by _____________ (stalling/forwarding) rather than by _____________ (stalling/forwarding). In Tomasulo part 2, though there is no explicit stall signal, __________________________________________________________________________________________________________________________________________________________________________

8000_0180

Cause EPC

9pts

reg STALL;wire STALL_combinational;

6pts

ZD

YD

XD

0

1

0

1

0

1

A

B

Add

er S

EX1

Y_Mux

X_Mux

Z1_mux

ZD

XD

+YD

X_FORW1

Y_FORW1

Z_FORW1

8pts

4+2pts

8pts

Page 8: EE457 Final (~30%)

May 8, 2015 12:42 am EE457 Final - Spring 2015 8 / 9 C Copyright 2015 Gandhi Puvvada

5 ( 33 points) 20 min. Tomasulo

5.1 The instructions in the backend in Tomasulo ___________ (IoI-OoE-OoC/IoI-OoE-IoC) are in a virtual queue as they carry __________________________ (unique tokens from the Tag FIFO/ROB_Tag) which is basically the _______ (WP/RP) value at the time of dispatch.

5.2 When a mispredicted branch comes on CDB in IoI-OoE-IoC design, instructions on the wrong path in ROB are removed by moving ______ (WP/RP/either/neither) in ROB to the ROB tag of the mispredicted branch. This _____________ (effectively flushes/does not flush) the mispredicted branch itself. __________________________ (Correctly predicted branch/Mispredicted branch/both/neither) join(s) ROB from CDB. When a ______________________(correctly predicted branch/mispredicted branch/both/neither) rises to the top of ROB, ________________________________________________________________________________________________________________________________________(state what does it (or do they) do).

5.3 Memory disambiguation rules are simpler in ___________ (IoI-OoE-OoC/IoI-OoE-IoC) as we do not have to worry about ______________ (RAW/WAR/WAW/multiple of these (list them)) in this design. Assuming that we are not going to implement bypass counters associated with entries in LSQ (in a variant of the IoI-OoE-IoC design) state disambiguation rules to be observed by a lw and a sw. Of course each has to wait until its source register(s) is/are available and the effective address is calculated. What other rules they need to observe before leaving LSQ? _______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________

5.4 ROB search in IoI-OoE-IoC is expensive. Elaborate quantitatively: _____________________________________________________________________________________________________________________________________________________________________________

ROB

IoI - OoE - OoC design IoI - OoE - IoC design

3+2pts

10pts

12pts

6pts

Page 9: EE457 Final (~30%)

May 8, 2015 12:42 am EE457 Final - Spring 2015 9 / 9 C Copyright 2015 Gandhi Puvvada

6 ( 43 points) 25 min. Virtual Memory:

6.1 A 48-entry 3-way set-associative TLB with partial contents (in hex) is shown on the side. In this virtual memory system, 64KB pages are used. The processor is a 32-bit logical address 32-bit data byte addressable processor. To make it easy on your eyes (and mine too) I made invalid entries content blank. All together 12 entries are valid. Only the left-most entry in set #0 is complete. TAG = ABC and PPFN = 0120; the rest of them are identical and are incomplete: TAG = AB? and PPFN = 012?; 3. You are asked to fill-up to the extent possible ABC for the TAG and/or 0120 for PPFN as long as you do not violate any rule.Also try to have as many VPNs as possible to bear consecutive numbers (meaning in one contiguous virtual address space) and similarly try to have as many PPFNs as possible to bear consecutive numbers (meaning in one contiguous physical memory space). Explain your process briefly and tell us how many VPNs you made consecutive and how many PPFNs you made consecutive. ______________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________________4. If the above TLB is all empty to start with, translation for how many consecutive virtual pages can we hold in this TLB? ___________. If we are unlucky, how early can we get conflict, requiring replacement in TLB? _______ Explain both ____________________________________________________________________________________________________________5. How many comparators of what size are used in this TLB? ______________________________________________________________________________________________________6. If we change the mapping to Fully associative mapping without changing the number of total entries in the TLB, how many comparators of what size are needed? _____________________ ____________________________________________________________________________7. Does TLB mapping affect the organization of the Page Table? Y / N Does Page Table organization affect the TLB organization? Y / N

VA31 VA[1:0]

PA31 PA[1:0]

1. Divide the Virtual Address into VPNand Page offset fields. Divide the VPNfurther into Tag and Set fields.

2.Similarly divide the Physical Addressinto PPFN and Page offset fields.

2+2pts

18pts

6pts

6pts

6pts

2+1ptsWe enjoyed teaching EE457. Hope you liked the course and learned the subject well. Hope to see many of you in the summer EE560. Best wishes! Gandhi, TAs: Jizhe, Sanmukh, Mentors: Madhusudhan, Minnu, HW Graders: Bhuvana, Yixian, Lab graders: Jun, Arnav