computer organization and design Chap1-5

7/24/2019 computer organization and design Chap1-5

1/52

Part II: Solut ions Guide


2/52


3/52

Part II: Solutions Guide

53

1.33

d

1.34

g

1.35

c

1.36

g

1.37

d

1.38

c

1.39

j

1.40

b

1.41

f

1.42

h

1.43

a

1.44

a

1.45

Time for

Time for

1.46

As discussed in section 1.4, die costs rise very fast with increasing die area. Con-sider a wafer with a large number of defects. It is quite likely that if the die area is verysmall, some dies will escape with no defects. On the other hand, if the die area is verylarge, it might be likely that every die has one or more defects. In general, then, die areagreatly affects yield (as the equations on page 48 indicate), and so we would expect thatdies from wafer B would cost much more than dies from wafer A.

1.47

The die area of the Pentium processor in Figure 1.16 is 91 mm

2

and it containsabout 3.3 million transistors, or roughly 36,000 per square millimeter. If we assume theperiod has an area of roughly .1 mm

2

, it would contain 3500 transistors (this is certainlya very rough estimate). Similar calculations with regard to Figure 1.26 and the Intel4004 result in 191 transistors per square millimeter or roughly 19 transistors.

1.48

We can write Dies per wafer = f

((Die area)

1

) and Yield = f

((Die area)

2

) and thusCost per die = f

((Die area)

3

). More formally, we can write:

1.49

No solution provided.

1.50

From the caption in Figure 1.16 we have 198 dies at 100% yield. If the defectdensity is 1 per square centimeter, then the yield is approximated by 1/((1 + 1 .91/2)

2

)= .47. Thus 198

.47 = 93 dies with a cost of $1000/93 = $10.75 per die.

1.51

Defects per area.

1

2--- revolution 1

2---= rev 1

5400------------ minutes

rev----------------------- 60 ondssec

minute----------------------- 5.56 ms=

12--- revolution

12---

= rev 17200------------ minutes

rev----------------------- 60 ondssecminute----------------------- 4.17 ms

=

Cost per die Cost per waferDies per wafer yield----------------------------------------------------------

=

Dies per wafer Wafer areaDie area----------------------------- =

Yield 11 Defect per area Die area 2 + ( )2 -------------------------------------------------------------------------------------------- =


4/52

54

Instructors Manual for Computer Organization and Design

1.52

1.53

1.54


1.55


1.56


1980 Die area 0.16 Yield 0.48Defect density 17.04

1992 Die area 0.97 Yield 0.48Defect density 1.98

1992 +1980 Improvement 8.62

Yield 11 Defects per area Die area 2 + ( )2

----------------------------------------------------------------------------------------------

=


5/52


55

2.1

For program 1, M2 is 2.0 (10/5) times as fast as M1. For program 2, M1 is 1.33 (4/3)times as fast as M2.

2.2

Since we know the number of instructions executed and the time it took to executethe instructions, we can easily calculate the number of instructions per second whilerunning program 1 as (200

10

6

)/10 = 20

10

6

for M1 and (160

10

6

)/5 = 32

10

6

forM2.

2.3

We know that Cycles per instruction = Cycles per second / Instructions per sec-ond. For M1 we thus have a CPI of 200

10

6

cycles per second / 20

10

6

instructionsper second = 10 cycles per instruction. For M2 we have 300/32 = 9.4 cycles per instruc-tion.

2.4

We are given the number of cycles per second and the number of seconds, so wecan calculate the number of required cycles for each machine. If we divide this by the

CPI well get the number of instructions. For M1, we have 3 seconds

200

10

6

cy-cles/second = 600

10

6

cycles per program / 10 cycles per instruction = 60

10

6

in-structions per program. For M2, we have 4 seconds

300

10

6

cycles/second = 1200

10

6

cycles per program / 9.4 cycles per instruction = 127.7

10

6

instructions per pro-gram.

2.5

M2 is twice as fast as M1, but it does not cost twice as much. M2 is clearly the ma-chine to purchase.

2.6

If we multiply the cost by the execution time, we are multiplying two quantities,for each of which smaller numbers are preferred. For this reason, cost times executiontime is a good metric, and we would choose the machine with a smaller value. In theexample, we get $10,000

10 seconds = 100,000 for M1 vs. $15,000

5 seconds = 75,000for M2, and thus M2 is the better choice. If we used cost divided by execution time andassume we choose the machine with the larger value, then a machine with a ridiculous-ly high cost would be chosen. This makes no sense. If we choose the machine with thesmaller value, then a machine with a ridiculously high execution time would be cho-sen. This too makes no sense.

2.7

We would define cost-effectiveness as performance divided by cost. This is essen-tially (1/Execution time)

(1/Cost), and in both cases larger numbers are more cost-effective when we multiply.

2.8

We can use the method in Exercise 2.7, but the execution time is the sum of the twoexecution times.

So M1 is slightly more cost-effective, specically 1.04 times more.

2

Solutions

Executions per second per dollar for M1 113 10,000----------------------------

1130,000-------------------

= =

Executions per second per dollar for M2 19 15,000-------------------------

1135,000-------------------

= =


6/52

56


2.9

We do this problem by finding the amount of time that program 2 can be run in anhour and using that for executions per second, the throughput measure.

With performance measured by throughput for program 2, machine M2 is = 1.2

times faster than M1. The cost-effectiveness of the machines is to be measured in unitsof throughput on program 2 per dollar, so

Cost-effectiveness of M1 = = 0.053

Cost-effectiveness of M2 = = 0.043

Thus, M1 is more cost-effective than M2. (Machine costs are from Exercise 2.5.)

2.10

For M1 the peak performance will be achieved with a sequence on instructionsof class A, which have a CPI of 1. The peak performance is thus 500 MIPS.

For M2, a mixture of A and B instructions, both of which have a CPI of 2, will achievethe peak performance, which is 375 MIPS.

2.11

Lets find the CPI for each machine first. , and

. Using , we get

the following: , and

.

M2 has a smaller execution time and is thus faster by the inverse ratio of the executiontime or 250/200 = 1.25.

2.12

M1 would be as fast if the clock rate were 1.25 higher, so 500

1.25 = 625 MHz.

2.13

Note: There is an error in Exercise 2.13 on page 92 in the text. The table entry for row c,column 3 (CPI on M2) should be 3 instead of 8. This will be corrected in the first reprint of the book. With the corrected value of 3, this solution is valid.Using C1, the CPI on M1 = 5.8and the CPI on M2 = 3.2. Because M1 has a clock rate twice as fast as that of M2, M1 is1.10 times as fast. Using C2, the CPI on M1 = 6.4 and the CPI on M2 = 2.9. M2 is

Executions of P2 per hour3600secondshour--------------------- 200

secondsExecution of P1-----------------------------------------

seconds

Execution of P2-----------------------------------------

------------------------------------------------------------------------------------------------

=

Executions of P2 per hour on M13600secon shour--------------------- 200 10

3-------------------------------------------------------------1600

3------------533= = =

Executions of P2 per hour on M23600secondshour--------------------- 200 5

4----------------------------------------------------------2600

4------------650= = =

650533---------

53310,000----------------

65015,000----------------

CPI for M1 1 2 3 4 + + + 4-------------------------------

2.5= =

CPI for M2 2 2 4 4 + + + 4-------------------------------

3.0= = CPU time Instruction count CPIClock rate-------------------------------------------------------------

=

CPU time for M1 Instruction count 2.5

500 MHz-----------------------------------------------------------Instruction count

200 million--------------------------------------------- = =

CPU time for M2 Instruction count 3750 MHz------------------------------------------------------- Instruction count250 million

---------------------------------------------

= =


7/52


57

(6.4/2)/2.9 = 1.10 times as fast. Using a third-party product, CPI on M1 = 5.4 and onM2 = 2.8. The third-party compiler is the superior product regardless of machine pur-chase. M1 is the machine to purchase using the third-party compiler, as it will be 1.04times faster for typical programs.

2.14

Let I = number of instructions in program and C = number of cycles in program.

The six subsets are {clock rate, C} {cycle time, C} {MIPS, I} {CPI, C, MIPS} {CPI, I, clockrate} {CPI, I, cycle time}. Note that in every case each subset has to have at least one rate{CPI, clock rate, cycle time, MIPS} and one absolute {C, I}.

2.15

. Lets find the CPI for MFP first:

; of course, the CPI for

MNFP is simply 2. So and

.

2.16

2.17

. So execution time is = 1.08 seconds, and execu-

tion time on MNFP is = 5.52 seconds.

2.18

CPI for Mbase = 2

0.4 + 3

0.25 + 3

0.25 + 5

0.1 = 2.8

CPI for Mopt = 2

0.4 + 2

0.25 + 3

0.25 + 4

0.1 = 2.45

2.19

MIPS for Mbase= 500/2.8 = 179. MIPS for Mopt = 600/2.45 = 245.

2.20

Since its the same architecture, we can compare the native MIPS ratings. Moptis faster by the ratio 245/179 = 1.4.

2.21

This problem can be done in one of two ways. Either find the new mix and adjustthe frequencies first or find the new (relative) instruction count and divide the CPI bythat. We use the latter.

.

So we can calculate CPI as

2.22

How must faster is Mcomp than Mbase?

Instruction class Frequency on MFP Count on MFP in millions Count on MNFP in millions

Floating point multiply 10% 30 900Floating point add 15% 45 900Floating point divide 5% 15 750Integer instructions 70% 210 210

Totals 100% 300 2760

MIPS Clock rateCPI 10

6

---------------------------=

CPI for MFP 0.1 6 0.15 4 0.05 20 0.7 2++ 3.6= =

MIPS for MFP 1000CPI------------278= =

MIPS for MNFP 1000CPI------------500= =

Execution time IC 10 6

MIPS---------------------= 300

278---------

2760500------------

Ratio of instructions 0.9 0.4 0.9 0.25 0.85 0.25 0.1 0.95+++ 0.81= =

CPI 2 0.4 0.9 3 0.25 0.9 3 0.25 0.85 5 0.1 0.95+++0.81---------------------------------------------------------------------------------------------------------------------------------------------------------

3.1= =

CPU time Mbase Clock rateIC CPI

--------------------------- Clock rateIC 2.8

---------------------------= =

CPU time Mcomp Clock rateIC 0.81

3.1

------------------------------------ Clock rateIC 2.5

---------------------------= =


8/52

58


So then

2.23

The CPI is different from either Mbase or Mcomp; find that first:

2.24

First, compute the performance growth after 6 and 8 months. After 6 months =1.034

6

= 1.22. After 8 months = 1.034

8

= 1.31. The best choice would be to implementeither Mboth or Mopt.

2.25


2.26

Total execution time of computer A is 1001 seconds; computer B, 110 seconds;computer C, 40 seconds. Computer C is fastest. Its 25 times faster than computer A and2.75 times faster than computer B.

2.27

We can just take the GM of the execution times and use the inverse.

, , and , so C isfastest.

2.28

A, B: B has the same performance as A. If we run program 2 once, how manytimes should we run program 1: , or x = 100. So the mix is 99%program 1, 1% program 2.

B, C: C is faster by the ratio of . Program 2 is run once, so we have

, x = 3.1 times. So the mix is 76% program 1 and 24% pro-gram 2.

A, C: C is also faster by 1.6 here. We use the same equation, but with the proper times:

x

+ 1000 = 1.6

(20

x

+ 20), x = 31.2. So the mix is 97% program 1 and 3% program 2.Note that the mix is very different in each case!

2.29

So B is fastest; it is 1.10 times faster than C and 5.0 times faster than A. For an equalnumber of executions of the programs, the ratio of total execution times A:B:C is1001:110:40, thus C is 2.75 times faster than B and 25 times faster than A.

Program Weight Comput er A Comput er B Comput er C

Program 1 (seconds) 10 1 10 20Program 2 (seconds) 1 1000 100 20Weighted AM 9.18 18.2 20

Performance MbothPerformance Mbase-----------------------------------------------------

CPU time MbaseCPU time Mboth---------------------------------------------

Clock rateIC 2.8

---------------------------

Clock rateIC 2.5

------------------------------------------------------ 2.82.5-------

1.12 = = = =

Mboth CPI 2 0.4 0.9

2 0.25 0.9

3 0.25 0.85

4 0.1 0.95

+

+

+

0.81---------------------------------------------------------------------------------------------------------------------------------------------------------

2.7= =

Performance MbothPerformance Mbase-----------------------------------------------------

CPU time MbaseCPU time Mboth---------------------------------------------

Clock rateIC 2.8

---------------------------

Clock rateIC 2.2

------------------------------------------------------ 2.8 600 MHz

2.2 500 MHz------------------------------------

1.5

= = = =

GM(A) 1000 32= = GM(B) 1000 32= = GM(C) 400 20= =

x 1000+ 10x 100+=

3220------

1.6=

10x 100+ 1.6 20x 20+( )=


9/52


59

2.30

Equal time on machine A:

This makes A the fastest.

Now with equal time on machine B:

Machine B is the fastest.

Comparing them to unweighted numbers, we notice that this weighting alwaysmakes the base machine fastest, and machine C second. The unweighted mean makesmachine C fastest (and is equivalent to equal time weighting on C).

2.31

Assume 100 instructions, then the number of cycles will be 90

4 + 10

12 = 480cycles. Of these, 120 are spent doing multiplication, and thus 25% of the time is spentdoing multiplication.

2.32

Unmodified for 100 instructions we are using 480 cycles, and if we improve multipli-cation it will only take 420 cycles. But the improvement increases the cycle time by 20%.Thus we should not perform the improvement as the original is 1.2(420)/480 = 1.05times faster than the improvement!

2.33


2.34


2.35


2.36


2.37


2.38

2.39

The harmonic mean of a set of rates,

where AM is the arithmetic mean of the corresponding execution times.


Program 1 (seconds) 1 1 10 20Program 2 (seconds) 1/ 1000 1000 100 20Weighted AM 2 10.1 20


Program 1 (seconds) 1 1 10 20Program 2 (seconds) 1/ 10 1000 100 20Weighted AM 91.8 18.2 20

Program Comput er A Comput er B Comput er C

1 10 1 0.52 0.1 1 5

HM n

1Rate i-------------

i = 1

n

!

---------------------- = n

Time

i

i = 1

n

!

----------------------- = 1

Time

ii = 1!

n

-----------------------

------------------------ = 1

1 n--- Time i

i = 1

n

!

---------------------------=

1AM----------

=


10/52

60


2.40


2.41


2.42


2.43


2.44

Using Amdahls law (or just common sense) we can determine the following:

Speedup if we improve only multiplication = 100 / (30 + 50 + 20/4) = 100/85 =1.18.

Speedup if we only improve memory access = 100 / (100 (50 50/2)) = 100/75= 1.33.

Speedup if both improvements are made = 100 / (30 + 50/2 + 20/4) = 100/60 =1.67.

2.45

The problem is solved algebraically and results in the equation

100/(Y + (100 X Y) + X/4) = 100/(X + (100 X Y) + Y/2)

where X = multiplication percentage and Y = memory percentage. Solving, we getmemory percentage = 1.5

multiplication percentage. Many examples thus exist, e.g.,multiplication = 20%, memory = 30%, other = 50%, or multiplication = 30%, memory =45%, other = 25%, etc.

2.46

Rewrite the execution time equation:

Speed-up Execution time before improvementExecution time after improvement-----------------------------------------------------------------------------------------------

=

Execution time after improvement Execution time affected by improvementAmount of improvement------------------------------------------------------------------------------------------------------------

Execution time unaffected+=

= Execution time affecte Amount of improvement Execution time unaffecte + Amount of improvement--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


11/52


61

Rewrite execution time affected by improvement as execution time before improve-ment x f

, where f is the fraction affected. Similarly execution time unaffected.

The denominator has two terms: the fraction improved (

f

) divided by the amount of the improvement and, the fraction unimproved (1

f

).

= Execution time before improvement f

Amount of improvement--------------------------------------------------------------------------------------------------------- Execution time before improvement 1 f ( )+

= Execution time before improvement f Amount of improvement--------------------------------------------------------------------------------------------------------- Execution time before improvement 1 f ( )+

= Execution time before improvement f Amount of improvement--------------------------------------------------------------------------------------------------------- Execution time before improvement 1 f ( )+

= f Amount of improvement------------------------------------------------------------------1 f

( )

+ " #

$ % Execution time before improvement

Speedup Execution time before improvement f

Amount of improvement------------------------------------------------------------------ 1

f

( )

+

" #$ %

Execution time before improvement

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

=

Speedup 1 f

Amount of improvement------------------------------------------------------------------1

f

( )

+

" #$ %

-----------------------------------------------------------------------------------------------

=


12/52

62


3.1

The program computes the sum of odd numbers up to the largest odd numbersmaller than or equal to n

, e.g., 1 + 3 + 5 + ... + n

(or n

1 if n

is even). There are manyalternative ways to express this summation. For example, an equally valid answer isthat the program calculates (ceiling(

n

/2))

2

.

3.2

The code determines the most frequent word appearing in the array and returns itin $v1

and its multiplicity in $v0

.

3.3

Ignoring the four instructions before the loops, we see that the outer loop (whichiterates 5000 times) has four instructions before the inner loop and six after in the worstcase. The cycles needed to execute these are 1 + 2 + 1 + 1 = 5 and 1 + 2 + 1 + 1 + 1 + 2 =8, for a total of 13 cycles per iteration, or 5000 13 for the outer loop. The inner looprequires 1 + 2 + 2 + 1 + 1 + 2 = 9 cycles per iteration and it repeats 5000 5000 times, fora total of 9 5000 500 cycles. The overall execution time is thus approximately (5000

13 + 9 5000 5000) / (500 10

6

) = .45 sec. Note that the execution time for the innerloop is really the only code of significance.

3.4 addi $t0,$t1,100 # register $t0 = $t1 + 100 3.5

The base address of x

, in binary, is 0000 0000 0011 1101 0000 1001 0000 0000, which

implies that we must use lui

:

lui $t1, 0000 0000 0011 1101ori $t1, $t1, 0000 1001 0000 0000lw $t2, 44($t1)add $t2, $t2, $t0sw $t2, 40($t1)

3.6

addi $v0,$zero,1 # Initialize to avoid counting zero word

loop: lw $v1,0($a0) # Read next word from sourceaddi $v0,$v0,1 # Increment count words copiedsw $v1,0($a1) # Write to destinationaddi $a0,$a0,4 # Advance pointer to next sourceaddi $a1,$a1,4 # Advance pointer to next destbne $v1,$zero,loop# Loop if the word copied &

zero

Bugs:

1. Count (

$v0

) is not initialized.

2. Zero word is counted. (1 and 2 xed by initializing $v0 to 1).

3. Source pointer ( $a0 ) incremented by 1, not 4.

4. Destination pointer ( $a1 ) incremented by 1, not 4.

3.7

3 Solutions

Inst ruct ion Format op rs r t immediat e

lw $v1,0($a0) I 35 4 3 0addi $v0,$v0,1 I 8 2 2 1sw $v1,0($a1) I 43 5 3 0addi $a0,$a0,1 I 8 4 4 1addi $a1,$a1,1 I 8 5 5 1bne $v1,$zero,loop I 5 3 0 20


13/52

Part II: Solutions Guide 63

3.8 count = 1;do {

temp = *source;count = count + 1;*destination = temp;source = source + 1;destination = destination + 1;

} while (temp != 0);

3.9 The C loop is

while (save[i] == k)i = i + j;

with i , j , and k corresponding to registers $s3 , $s4 , and $s5 and the base of the arraysave in $s6 . The assembly code given in the example is

Code beforeLoop: add $t1, $s3, $s3 # Temp reg $t1 = 2 * i

add $t1, $t1, $t1 # Temp reg $t1 = 4 * iadd $t1, $t1, $s6 # $t1 = address of save [i]lw $t0, 0($t1) # Temp reg $t0 = save[i]bne $t0, $s5, Exit # go to Exit if save[i] & kadd $s3, $s3, $s4 # i = i + jj Loop # go to Loop

Exit:

Number of instructions executed if save[i + m * j] does not equal k for m = 10 anddoes equal k for 0 ' m ' 9 is 10 7 + 5 = 75, which corresponds to 10 complete itera-tions of the loop plus a nal pass that goes to Exit at the bne instruction before updat-ing i . Straightforward rewriting to use at most one branch or jump in the loop yields

Code afteradd $t1, $s3, $s3 # Temp reg $t1 = 2 * iadd $t1, $t1, $t1 # Temp reg $t1 = 4 * iadd $t1, $t1, $s6 # $t1 = address of save[i]lw $t0, 0($t1) # Temp reg $t0 = save[i]bne $t0, $s5, Exit # go to Exit if save[i] & k

Loop: add $s3, $s3, $s4 # i = i + jadd $t1, $s3, $s3 # Temp reg $t1 = 2 * iadd $t1, $t1, $t1 # Temp reg $t1 = 4 * iadd $t1, $t1, $s6 # $t1 = address of save[i]lw $t0, 0($t1) # Temp reg $t0 = save[i]beq $t0, $s5, Loop # go to Loop if save[i] = k

Exit:The number of instructions executed by this new form of the loop is 5 + 10 6 = 65. If 4 j is computed before the loop, then further saving in the loop body is possible.

Code after further improvement add $t2, $s4, $s4 # Temp reg $t2 = 2 * jadd $t2, $t2, $t2 # Temp reg $t2 = 4 * jadd $t1, $s3, $s3 # Temp reg $t1 = 2 * iadd $t1, $t1, $t1 # Temp reg $t1 = 4 * i


14/52

64 Instructors Manual for Computer Organization and Design

add $t1, $t1, $s6 # $t1 = address of save[i]lw $t0, 0($t1) # Temp reg $t0 = save[i]bne $t0, $s5, Exit # go to Exit if save[i] & k

Loop: add $t1, $t1, $t2 # $t1 = address of save [i + m * j]lw $t0, 0($t1) # Temp reg $t0 = save[i]beq $t0, $s5, Loop # go to Loop if save[i] = k

Exit:

The number of instructions executed is now 7 + 10 3 = 37.

3.10

Note: In the solutions, we make use of the li instruction, which should be imple-mented as shown in rows 3 and 4.

3.11 The fragment of C code is

for (i=0; i= $t3) go to L slt $at, $t5, $t3

beq $at, $zero, L


15/52


In section 3.8 the instructions addi (add immediate) and slti (set less than immedi-ate) are introduced. These instructions can carry constants in their machine code rep-resentation, saving a load instruction and use of a register. Now the loop bodyaccesses array elements, performs the computation, and tests for termination:

Loop: add $t4, $a1, $t0 # Temp reg $t4 = address of b[i]lw $t5, 0($t4) # Temp reg $t5 = b[i]add $t6, $t5, $t1 # Temp reg $t6 = b[i] + c

This add instruction would be add $t6, $t5, $s0 if it were assumed that $s0 holdsthe value of c . Continuing the loop:

add $t7, $a0, $t0 # Temp reg $t7 = address of a[i]sw $t6, 0($t7) # a[i] = b[i] + cadd $t0, $t0, $t2 # i = i + 4slt $t8, $t0, $t3 # $t8 = 1 if $t0 < 401, i.e., i ' 100bne $t8, $zero, Loop # go to Loop if i ' 100

The number of instructions executed is 4 + 101 8 = 812. The number of data refer-ences made is 3 + 101 2 = 205.

3.12 The problem is that we are using PC-relative addressing, so if the address of there is too far away, we wont be able to use 16 bits to describe where it is relative tothe PC. One simple solution would be

here: bne $t1, $t2, skipj there

skip:...

there: add $t1, $t1, $t1

This will work as long as our program does not cross the 256-MB address boundarydescribed in the elaboration on page 150.

3.13 Let I be the number of instructions taken by gcc on the unmodified MIPS. Thisdecomposes into .48I arithmetic instructions, .33I data transfer instructions, .17I condi-tional branches, and .02I jumps. Using the CPIs given for each instruction class, we geta total of (.48 1.0 + .33 1.4 + .17 1.7 + .02 1.2) I cycles; if we call the unmodifiedmachines cycle time C seconds, then the time taken on the unmodified machine is (.48 1.0 + .33 1.4 + .17 1.7 + .02 1.2) I C seconds. Changing some fraction, f (namely.25), of the data transfer instructions into the autoincrement or autodecrement versionwill leave the number of cycles spent on data transfer instructions unchanged. How-ever, each of the .33 I f data transfer instructions that is changed corresponds to anarithmetic instruction that can be eliminated. So, there are now only (.48 (.33 f)) Iarithmetic instructions, and the modified machine, with its cycle time of 1.1 C sec-onds, will take ((.48 .33f) 1.0 + .33 1.4 + .17 1.7 + .02 1.2) I 1.1 C secondsto execute gcc. When f is .25, the unmodified machine is 2.8% faster than the modifiedone.3.14 From Figure 3.38, 33% of all instructions executed by spice are data access in-structions. Thus, for every 100 instructions there are 100 + 33 = 133 memory accesses:one to read each instruction and 33 to access data.

a. The percentage of all memory accesses that are for data = 33/133 = 25%.


16/52


17/52


3.19

Code size is 22 bytes, and memory bandwidth is 22 + 28 = 50 bytes.



Accumulator

Inst ruct ion Code byt es Dat a byt es

load b # Acc = b; 3 4add c # Acc += c; 3 4

store a # a = Acc; 3 4add c # Acc += c; 3 4store b # Acc = b; 3 4neg # Acc =- Acc; 1 0add a # Acc -= b; 3 4store d # d = Acc; 3 4

Total: 22 28

Stack


push b 3 4push c 3 4add 1 0dup 1 0pop a 3 4push c 3 4add 1 0dup 1 0pop b 3 4neg 1 0push a 3 4add 1 0

pop d 3 4 Total: 27 28

Memory-Memory


add a, b, c # a=b+c 7 12add b, a, c # b=a+c 7 12sub d, a, b # d=ab 7 12

Total: 21 36


18/52


19/52


3.25 Here is the C code for itoa , as taken from The C Programing Language by Ker-nighan and Ritchie:

void reverse( char *s ){

int c, i, j;

for( i = 0, j = strlen(s)1; i < j; i++, j ) {c=s[i];s[i]=s[j];s[j] = c;

}}

void itoa( int n, char *s ){

int i, sign;

if( ( sign = n ) < 0 )n = n;

i = 0;do {

s[i++] = n % 10 + '0';} while( ( n /= 10 ) > 0 );if( sign < 0 )

s[i++] = '';s[i] = '\0';reverse( s );

}}

The MIPS assembly code, along with a main routine to test it, might look somethinglike this:

.datahello: .ascii "\nEnter a number:"newln: .asciiz "\n"str: .space 32

.textreverse: # Expects string to

# reverse in $a0# s = i = $a0# j = $t2

addi $t2, $a0, 1 # j = s 1;lbu $t3, 1($t2) # while( *(j+1) )beqz $t3, end_strlen

strlen_loop:addi $t2, $t2, 1 # j++;lbu $t3, 1($t2)bnez $t3, strlen_loop


20/52


end_strlen: # now j =# &s[strlen(s)1]

bge $a0, $t2, end_reverse # while( i < j )# {

reverse_loop:lbu $t3, ($a0) # $t3 = *i;lbu $t4, ($t2) # $t4 = *j;sb $t3, ($t2) # *j = $t3;sb $t4, ($a0) # *i = $t4;addi $a0, $a0, 1 # i++;addi $t2, $t2, 1 # j;blt $a0, $t2, reverse_loop # }

end_reverse:jr $31

.globl itoa # $a0 = nitoa: addi $29, $29, 4 # $a1 = s

sw $31, 0($29)move $t0, $a0 # sign = n;move $t3, $a1 # $t3 = s;bgez $a0, non_neg # if( sign < 0 )sub $a0, $0, $a0 # n = n

non_neg:li $t2, 10

itoa_loop: # do {div $a0, $t2 # lo = n / 10;

# hi = n % 10;

mfhi $t1mflo $a0 # n /= 10;

addi $t1, $t1, 48 # $t1 =# '0' + n % 10;sb $t1, 0($a1) # *s = $t1;addi $a1, $a1, 1 # s++;bnez $a0, itoa_loop # } while( n );bgez $t0, non_neg2 # if( sign < 0 )

# {li $t1, '' # *s = '';sb $t1, 0($a1) # s++;addi $a1, $a1, 1 # }

non_neg2:sb $0, 0($a1)

move $a0, $t3jal reverse # reverse( s );lw $31, 0($29)addi $29, $29, 4jr $31

.globl main


21/52


22/52


__start:# Print the prompts

li $v0, 4 # print_str system service ...la $a0, prmpt1 # ... passing address of first promptsyscallli $v0, 4 # print_str system service ...la $a0, prmpt2 # ... passing address of 2nd promptsyscall

# Read n and call fib with resultli $v0, 5 # read_int system servicesyscallmove $a0, $v0 # $a0 = n = result of readjal fib # call fib(n)move $s1, $v0 # $s0 = fib(n)

# Print resultli $v0, 4 # print_str system service ...la $a0, descr # ... passing address of output descriptorsyscallli $v0, 1 # print_int system service ...move $a0, $s # ... passing argument fib(n)syscall

# Call system exitli $v0, 10syscall

# Algorithm for Fib(n):# if (n == 0) return 0# else if (n == 1) return 1# else return fib(n1) + fib(n2).## Register usage:

# $a0 = n (argument)# $t1 = fib(n1)# $t2 = fib(n2)# $v0 = 1 (for comparison)## Stack usage:# 1. push return address, n, before calling fib(n1)# 2. pop n# 3. push n, fib(n1), before calling fib(n2)# 4. pop fib(n1), n, return address

fib: bne $a0, $zero, fibne0 # if n == 0 ...

move $v0, $zero # ... return 0jr $31fibne0: # Assert: n != 0

li $v0, 1bne $a0, $v0, fibne1 # if n == 1 ...jr $31 # ... return 1

fibne1: # Assert: n > 1


23/52


24/52


__start:# Print the prompts

li $v0, 4 # print_str system service ...la $a0, prmpt1 # ... passing address of first promptsyscallli $v0, 4 # print_str system service ...la $a0, prmpt2 # ... passing address of 2nd promptsyscall

# Read n and call fib with resultli $v0, 5 # read_int system servicesyscallmove $a2, $v0 # $a2 = n = result of readli $a1, 0 # $a1 = fib(0)li $a0, 1 # $a0 = fib(1)jal fib # call fib(n)move $s1, $v0 # $s0 = fib(n)

# Print resultli $v0, 4 # print_str system service ...la $a0, descr # ... passing address of output

# descriptorsyscallli $v0, 1 # print_int system service ...move $a0, $s1 # ... passing argument fib(n)syscall

# Call system - exitli $v0, 10syscall

# Algorithm for Fib(a, b, count):# if (count == 0) return b# else return fib(a + b, a, count 1).

## Register usage:# $a0 = a = fib(n1)# $a1 = b = fib(n2)# $a2 = count (initially n, finally 0).# $t1 = temporary a + bfib: bne $a2, $zero, fibne0 # if count == 0 ...

move $v0, $a1 # ... return bjr $31

fibne0: # Assert: n != 0addi $a2, $a2, 1 # count = count 1add $t1, $a0, $a1 # $t1 = a + b

move $a1, $a0 # b = amove $a0, $t1 # a = a + old bj fib # tail call fib(a+b, a, count1)

3.28 No solution provided.


25/52


3.29

start:sbn temp, b, .+1 # Sets temp = b, always goes to next instruction

sbn a, temp, .+1 # Sets a = a temp = a (b) = a + b

3.30 There are a number of ways to do this, but this is perhaps the most concise andelegant:

sbn c, c, .+1 # c = 0;sbn tmp, tmp, .+1 # tmp = 0;

loop: sbn b, one, end # while (b > 0)sbn tmp, a, loop # c = a; /* always continue */

end: sbn c, tmp, .+1 # c = tmp; /* = a b */


26/52


4.1 In the manner analogous to that used in the example on page 214, the number512ten = 5x102 + 1x101 + 2x100. Given that 1 two = 100, 1010two = 101, and 110 0100 two = 102,we have

512ten = 5x 110 0100two + 1x1001two + 2x1two= 110 0100

110 0100110 0100110 0100110 0100

1010 1

+ 1110 0000 0000

The number 512 ten is positive, so the sign bit is 0, and sign extension yields the 32-bittwos complement result 512 ten = 0000 0000 0000 0000 0000 0010 0000 0000two .

4.2 This exercise can be solved with exactly the same approach used by the solutionto Exercise 4.1, with the sign and sign extension being 1 digits. Because memorizing the base 10 representations of powers of 2 is useful for many hardware and software tasks,another conversion technique is widely used.

Let N be the magnitude of the decimal number, x[i] be the i th bit of the binary repre-sentation of N, and m be the greatest integer such that 2m ' N. Then

for (i = m; i ' 0; i = i - 1) {if (2 i ' N) x[i] = 1;else x[i] = 0;N = N - 2 i ; }

4 Solutions


27/52


For 1023ten, N = 1023 512 ) x[9] = 1

511 256 ) x[8] = 1

255

128 ) x[7] = 1127

64 ) x[6] = 163

32 ) x[5] = 131

16 ) x[4] = 115

8 ) x[3] = 17

4 ) x[2] = 1

3 2 ) x[1] = 11

1 ) x[0] = 10 Done

So N = 0000 0000 0000 0000 0000 0011 1111 1111twoThus 1023 ten = 1111 1111 1111 1111 1111 1100 0000 0001two4.3 Using the method of the solution to either Exercise 4.1 or Exercise 4.2, 4,000,000 ten= 1111 1111 1100 0010 1111 0111 0000 0000two .

4.4 We could substitute into the formula at the bottom of page 213, (x31 x 2 31) + (x30x 230) + (x29 x 229) +...+ (x1 x 21) + (x0 x 20), to get the answer. Because this twos com-plement number has predominantly 1 digits, the formula will have many nonzeroterms to add. The negation of this number will have mostly 0 digits, so using the nega-tion shortcut in the Example on page 216 first and remembering the original sign willsave enough work in the formula to be a good strategy. Thus,

Negating 1111 1111 1111 1111 1111 1110 0000 1100twois 0000 0000 0000 0000 0000 0001 1111 0011two+ 1two= 0000 0000 0000 0000 0000 0001 1111 0100two

Then the nonzero terms of the formula are28 + 27 + 26 + 25 + 24 + 22 = 500

and the original sign is 1 meaning negative, so1111 1111 1111 1111 1111 1110 0000 1100 two = 500ten4.5 Here, negating first as in the solution for Exercise 4.4 really pays off. The negationis the all-zero string incremented by 1, yielding +1. Remembering the original sign isnegative, 1111 1111 1111 1111 1111 1111 1111 1111 two = 1ten.

4.6 Negating the twos complement representation gives1000 0000 0000 0000 0000 0000 0000 0001 which equals(1 x 231) + (1 x 20) = 2,147,483,648ten + 1ten

= 2,147,483,647ten

Recalling that the original twos complement number is positive,0111 1111 1111 1111 1111 1111 1111 1111 two = 2,147,483,647ten.


28/52


29/52


b.

c.

d.

opcode (6 bits) = 100011 = lw

rs (5 bits) = 11111 = 31

rt (5 bits) = 01111 = 15

address (16 bits) = 1100 0000 0000 0000

Since the address is negative we have to take its twos complement.

Therefore the instruction is lw 15, 16384(31) .

Notice that the address embedded within the 16-bit immediate eld is a byte addressunlike the constants embedded in PC-relative branch instructions where wordaddressing is used.

4.15

a. 0

b. 0

c. 0.0

d. sll $0,$0,04.16 Figure 4.54 shows 21% of the instructions as being lw. If 15% of these could takeadvantage of the new variant, that would be 3.2% of all instructions. Each of these pre-sumably now has an addu instruction (that adds the two registers values together) thatcould be eliminated. Thus roughly 3.2% of the instructions, namely those addu instruc-

A = 1000 1111 1110 1111 1100 0000 0000 0000= 8FEFC000= 8 X 167 + 15 166 + 14 + 165 + 15 164 + 12 163

= 2,147,483,648 + 251,658,240 + 14,680,064 + 983,040 + 49,152= 2,414,854,144

s = 1exponent = 0001 1111

= 25 1 = 31significand = 110 1111 1100 0000 0000 0000

(1)S (1 + significand) 2 exponent127 = 1x1.1101 1111 1x296

= 1 (1 + 13 161 + 15 162 + 29) 296

= 1.873 296

= 2.364 1029

Twos complement of address = 0100 0000 0000 0000address = 214

= 16384


30/52


tions, could be eliminated if the addition were now done as part of the lw. The savingsmay be a bit overestimated (slightly less than 3.2%) due to the fact that some existinginstructions may be sharing addu instructions. Thus we might not eliminate one addufor every changed lw.

4.17 Either the instruction sequence

addu $t2, $t3, $t4sltu $t2, $t2, $t4

oraddu $t2, $t3, $t4sltu $t2, $t2, $t3

work.

4.18 If overflow detection is not required, then

addu $t3, $t5, $t7sltu $t2, $t3, $t5

addu $t2, $t2, $t4addu $t2, $t2, $t6

is sufcient. If overow detection is desired, then useaddu $t3, $t5, $t7sltu $t2, $t3, $t5add $t2, $t2, $t4add $t2, $t2, $t6

If overow detection is desired, the last two addu instructions should be replaced byadd instructions.

4.19 To detect whether $s0 < $s1 , its tempting to subtract them and look at the sign

of the result. This idea is problematic, because if the subtraction results in an overflowan exception would occur! To overcome this, there are two possible methods: You cansubtract them as unsigned numbers (which never produces an exception) and thencheck to see whether overflow would have occurred (this is discussed in an elaborationon page 223). This method is acceptable, but it is lengthy and does more work than nec-essary. An alternative would be to check signs. Overflow can occur if $s0 and ( $s1 )share the same sign; i.e., if $s0 and $s1 differ in sign. But in that case, we dont need tosubtract them since the negative one is obviously the smaller! The solution inpseudocode would be

if ($s00) then$t0:=1

else if ($s0>0) and ($s1


31/52


4.21

sll $s1, $s0, 2addu $s1, $s0, $s1


4.23 The ALU-supported set less than ( slt ) uses just the sign bit. In this case, if we trya set less than operation using the values 7 ten and 6 ten we would get 7 > 6. This isclearly wrong. Modify the 32-bit ALU in Figure 4.11 on page 169 to handle slt correct-ly by factor in overflow in the decision.

If there is no overow, the calculation is done properly in Figure 4.17 and we simplyuse the sign bit (Result31). If there is overow, however, then the sign bit is wrong andwe need the inverse of the sign bit.

LessThan = Overflow* Result31


4.25

Overow Result 31 LessThan

0 0 00 1 11 0 11 1 0

10ten = 1010two

= 1.01two 23

Sign = 0 Significand = .01Single exponent = 3 + 127 = 130

Double exponent = 3 + 1023 = 1026

OverflowResult31 LessThan

10000000010

0 10000010 01000000000000000000000

1 8 23

0 0100000000000000000000000000000000000000000000000000

1 11 52

Single precision

Double precision


32/52


4.26

The solution is the same as that for Exercise 4.25, but with the fourth bit from the leftof the signicand changed to a 1.

4.27

4.28

2/3 = 1.01 21 Single exponent = 1 + 127 = 126


4.29

main( ){

float x;printf("> ");scanf("%f", &x);printf("%08lx\n", * (long *) &x);

}

10.5ten = 1010.1two = 1.0101two 23

Sign = 0 Significand = .0101

Single exponent = 3 + 127 = 130Double exponent = 3 + 1023 = 1026

0.1ten = 0.00011two = 1.10011two 24

Sign = 0 Significand= .10011Single exponent = 4 + 127 = 123


01111111011

0 0111101110011001100110011001100

8 23

0 1001100110011001100110011001100110011001100110011001

1 11 52

Single

Double

10011001100110011001101truncround

precision 1001100110011001100110011001100110011001100110011010truncround

precision

1

01111111110

1 0111111001010101010101010101010

8 23

10101010101010101010101010101010101010101010101010101

1 11 52

Single

Double

01010101010101010101011truncround

precision 0101010101010101010101010101010101010101010101010110truncround

precision

1


33/52


4.30

#include// assumes float and long are both 32-bit numbersmain( ){

float x;cout > x;cout


34/52


35/52


4.36

Therefore,

4.37

Therefore,

Note that the percentages given in Figure 4.54 for spice add up to 97%.4.38 No solution provided.


4.40

xor $s0, $s0, $s1xor $s1, $s0, $s1xor $s0, $s0, $s1

Inst ruct ion cat egory Average CPI Frequency (%)

Loads and stores 1.4 21 +12 =33Conditional branch 1.8 9 +8 +1 +1 =19

Jumps 1.2 1 +1 =2

Integer multiply 10.0 0Integer divide 30.0 0FP add and subtract 2.0 0FP multiply, single precision 4.0 0FP multiply, double precision 5.0 0FP divide, single precision 12.0 0FP divide, double precision 19.0 0Integer ALU other than multiply and divide(addu, addiu, and, andi, sll,lui, lb, sb, slt, slti, sltu,sltiu, sra, lh )

1.0 9 +17 +1 +2 +5 +2 +1+1 +2 +1 +1 +1 +2 +1 =46

CPI = 1.4 33% + 1.8 19% + 1.2 2% + 1.0 46%= 1.3

Inst ruct ion cat egory Average CPI Frequency (%)

Loads and stores 1.4 7 +2 +24 +9 =42Conditional branch 1.8 3 +2 +1 +1 +1 =8

Jumps 1.2 1 +1 =2Integer multiply 10.0 0Integer divide 30.0 0FP add and subtract 2.0 4 +3 =7FP multiply, single precision 4.0 0FP multiply, double precision 5.0 5FP divide, single precision 12.0 0FP divide, double precision 19.0 2Integer ALU other than multiply and divide(addu, addiu, subu, andi, sll,srl, lui, c.x.d, mtcl, mtc2,cut )

1.0 10 +1 +1 +1 +5 +1 +6+1 +2 +2 +1 =31

CPI = 1.4 42% + 1.8 8% + 1.2 2% + 2.0 7% + 5.0 5% + 19.0 2% + 1.0 31%= 1.8


36/52


37/52


4.43 Here is the equation:

Sum = (a b CarryIn) + (a b CarryIn) + (a b CarryIn) + (a b CarryIn)

4.44 C1 = c4, C2 = c8, C3 = c12, and C4 = c16.

c4 = G3,0 + (P3,0 c0).

c8 is given in the exercise.

c12 = G11,8 + (P11,8 G7,4) + (P11,8 P7,4 G3,0) + (P11,8 P7,4 P3,0 c0).

c16 = G15,12 + (P15,12 G11,8) + (P15,12 P11,8 G7,4)+ (P15,12 P11,8 P7,4 G3,0) + (P15,12 P11,8 P7,4 P3,0 c0).

4.45 The equations for c4, c8, and c12 are the same as those given in the solution toExercise 4.44. Using 16-bit adders means using another level of carry-lookahead logicto construct the 64-bit adder. The second level generate, G0 +, and propagate, P0 +, are

G0+ = G15,0 = G15,12 + P15,12 G11,8 + P15,12 P11,8 G7,4 + P15,12 P11,8 P7,4 G3,0

and

P0+ = P15,0 = P15,12 P11,8 P7,4 P3,0

Using G0 + and P0 + we can write c16 more compactly as

c16 = G15,0 + P15,0 c0

and

c32 = G31,16 + P31,16 c16c48 = G47,32 + P47, 32 c32c64 = G63,48 + P63,48 c48

A 64-bit adder diagram in the style of Figure 4.24 is shown on the following page.

Sum

CarryIn

a

b


38/52






4.50 The longest paths through the top (ripple carry) adder organization in Figure4.56 all start at input a0 or b0 and pass through seven full adders on the way to outputs4 or s5. There are many such paths, all with a time delay of 7 2T = 14T. The longestpaths through the bottom (carry save) adder all start at input b0, e0, f0, b1, e1, or f1 andproceed through six full adders to outputs s4 or s5. The time delay for this circuit isonly 6 2T = 12T.



c0 CarryIn

Result0 15ALU0

CarryIn

Result16 31

ALU1

CarryIn

Result32 47

ALU2

CarryIn

c64 CarryOut

Result48 63

ALU3

CarryIn

c16

c48

P63, 48G63, 48

a48, 63

Second-level carry-lookahead unit

b48, 63

b32, 47

a32, 47

P47, 32G47, 32

c32

P31, 16G31, 16

a16, 31

b16, 31

P15, 0G15, 0

a0, 15

b0, 15


39/52


4.53

One example of 6-bit operands that run faster when Booths algorithm looks at 3 bitsat a time is 21 ten 27ten = 567ten .

Two-bit Booths algorithm:

010101 = 21ten 011011 = 27ten 010101 10 string (always start with padding 0 to right of LSB)0000000 11 string, middle of a string of 1s, no operation

+ 01010100 01 string, add multiplicand 010101000 10 string, subtract multiplicand0000000000 11 string

+ 01010100000 01 string11111101011 twos complement with sign extension as needed00000000000 zero with sign extension shown00001010100 positive multiplicand with sign extension1110101100000000000000

+ 0101010000001000110111 = 567ten

Dont worry about the carry out of the MSB here; with additional sign extension forthe addends, the sum would correctly have an extended positive sign. Now, using the3-bit Booths algorithm:

010101 = 21ten 011011 = 27ten 010101 110 string (always start with padding 0 to right of LSB)

01010100 101 string, subtract the multiplicand+ 01010100000 011 string, add twice the multiplicand (i.e, shifted left 1 place)

11111101011 twos complement of multiplicand with sign extension11110101100 twos complement of multiplicand with sign extension

+ 01010100000 01000110111 = 567 ten

Cu rrent bi t s Prev. bi t s

ai + 1 ai ai 1 Operat ion Reason

0 0 0 None Middle of a string of 0s0 0 1 Add the multiplicand End of a string of 1s

0 1 0 Add the multiplicand A string of one 1, so subtract the multiplicand atposition i for the beginning of the string and add twicethe multiplicand (twice to align with position i+l) forthe end of the string; net result, add the multiplicand

0 1 1 Add twice themultiplicand

End of a string of 1s; must align add with 0 in positioni+1

1 0 0 Subtract twice themultiplicand

Beginning of a string of 1s; must subtract with 1 inposition i+1

1 0 1 Subtract themultiplicand

End of string of 1s, so add multiplicand, plusbeginning of a string of 1s, so subtract twice themultiplicand; net result is to subtract the multiplicand

1 1 0 Subtract themultiplicand

Beginning of a string of 1s

1 1 1 None Middle of a string of 1s


40/52


Using the 3-bit version gives only 3 addends to sum to get the product versus 6addends using the 2-bit algorithm.

Booths algorithm can be extended to look at any number of bits b at a time. Theamounts to add or subtract include all multiples of the multiplicand from 0 to 2 (b1).Thus, for b > 3 this means adding or subtracting values that are other than powers of

two multiples of the multiplicand. These values do not have a trivial shift left by thepower of 2 number of bit positions method of computation.

4.54

1. Subtract the Divisor register from the Remainderregister, and place the result in the left half of theremainder register

2a. Shift the Remainder reg-ister to the left and set thenew rightmost bit to 1

2b. Shift the Remainder reg-ister to the left and set newrightmost bit to 0

3a. Subtract Divisor to lefthalf of the Remainder

3b. Add Divisor to lefthalf of the Remainder

Done: Shift left half ofRemainder right 1 bit

Subtract Divisor

Done: Shift left half ofRemainder right 1 bit

Remainder >= 0Test Remainder

Remainder < 0

Yes: 32 repetitions32nd repetition? 32nd repetition?

No: < 32 repetitions No: < 32 repetitions


41/52


Note that the shaded boxes are necessary when Remainder < 0 to correct the redun-dant addition in step 3b should the last step end up in this branch of the ow chart. Asshown by the table, 7 divided by 2 is 3 with a remainder of 1.

4.55 Various rounding modes are possible. Round To Nearest is the rounding modethat is most familiar, with the exception of handling one special case. In the event thatthe number to round is exactly halfway between two values with the desired numberof digits, the Rounding to Nearest tie is broken by rounding to the value with an evenleast significant digit. This removes any consistent bias in the direction of roundinghalf the ties are broken by rounding up and half are broken by rounding down. Othermodes of rounding in IEEE 754 are Round Toward Positive Infinity, Round TowardNegative Infinity, and Round Toward Zero (commonly known as Truncation). Whenno digits to the right of the least significant digit are kept in the arithmetic circuitry,then right-shifting the significand to align the exponents results in rounding via Trun-cation by default.

With guard and round digits (assume Round to Nearest mode is chosen):

Extend with guard and round digits 9.5100 102

Shift smaller number to match larger exponent + 0.6420 102Add (circuit must hold extra high-order digit) 10.1520 102

Normalize the sum 1.0152 102 Round to three signicant digits 1.02 102

Without guard and round digits (Truncation occurs):

9.51 102Shift smaller number, truncate, keep only 3 digits + 0.64 102

Add (circuit must hold extra high-order digit) 10.15 102Normalize the sum (truncation occurs) 1.01 102

Truncation saves spaceno guard and rounding digits are neededand it savestimethe shifting by normalization yields the rounded result. The disadvantage isthe quality of the math, as shown by comparison with the more accurate resultobtained using the more space (circuitry) and time-intensive Round to Nearest mode.

It erat ion St ep Divisor Remainder

0 Initial Values 0010 0000 11101 1: Rem Rem Div 0010 1110 1110

2b: Rem 0 ) sll R, R0 =1 0010 0010 0011Done Shift Rem right 1 0010 0001 0011


42/52


43/52


1. Subtract the exponent of the divisor from thedividend

Overflow or

2. Divide the significands. Check for divide byzero

3. Normalize the product if necessary, shifting itright and incrementing the exponent

4. Round significand to the appropriate number of bits

Still normalized?

5. Set the sign of the product to positive if the signsof the original operands are the same. If they differmake the sign negative

underflow?Yes

NoInterrupt

No

Yes

Done


44/52


4.58 The guard bit is present to preserve the full precision (number of significant bits)of the significand during the operation. It can be shown that a single bit is sufficient for+, , , and / to preserve the precision of a result that can be expressed in the numberof bits available in the significand. If the result has more bits of precision than availablein the floating-point format, then rounding is used to yield an expressible approxima-tion (fits in the available number of bits) of the result.

The bits guard (g), round (r), and sticky (s) all participate in right shifts of the signi-cand to make the exponent of the smaller number match that of the larger number.Right shifting moves the LSB into g, g into r, and the OR of bit(s) shifted through r intos. During left shifts, s does not participate, zeros are shifted into r as needed, r shiftsinto g, and g shifts into the LSB.

The following tables list the actions needed for each rounding mode.

Round Toward Negative Innity:

Round Toward Positive Innity:

Truncate:

g r s Act ion

0 0 0 Exact result; significand correctly roundedX X 1 Inexact result, correctly rounded if Sum is positive, round by adding 1 to LSB

if Sum is negative (makes Sum more negative, moving toward negativeinfinity)

X 1 X Inexact result, correctly rounded if Sum is positive, round by adding 1 to LSBif Sum is negative (makes Sum more negative, moving toward negativeinfinity)

1 X X Inexact result, correctly rounded if Sum is positive, round by adding 1 to LSBif Sum is negative (makes Sum more negative, moving toward negativeinfinity)

g r s Act ion

0 0 0 Exact result; significand correctly roundedX X 1 Inexact result, correctly rounded if Sum is negative, round by adding 1 to

LSB if Sum is positive (makes Sum more positive, moving toward positiveinfinity)

X 1 X Inexact result, correctly rounded if Sum is negative, round by adding 1 toLSB if Sum is positive (makes Sum more positive, moving toward positiveinfinity)

1 X X Inexact result, correctly rounded if Sum is negative, round by adding 1 toLSB if Sum is positive (makes Sum more positive, moving toward positiveinfinity)

g r s Act ionX X X Significand correctly rounded


45/52


(Unbiased) Round to Nearest:

Note that unbiased round to nearest is the only mode that requires the value of theLSB. Neither unbiased round to nearest nor truncate predicate their action on the signof the signicand. Round to plus and minus innity are the only modes that requirethe sign bit.

Filling in the table given in the exercise:



Sum p g r s Act ion

X 0 0 0 Exact result; significand correctly roundedX 0 X 1 Inexact result, but significand is correctly roundedX 0 1 0 Inexact result, but significant is correctly rounded0 1 0 0 Tie case with even significand; significand correct1 1 0 0 Tie case with odd significand; round to nearest even by adding 1

to LSBX 1 1 X Inexact result, round to nearest by adding 1 to LSBX 1 X 1 Inexact result, round to nearest by adding 1 to LSB

Rounding mode Sum > = 0 Sum < 0

Toward Negative Infinity blank NAND(g,r,s) Toward Positive Infinity NAND(g,r,s) blank Truncate blank blank(Unbiased) Round to Nearest g (Sum p + r +s) g (Sum p +r +s)


46/52


5.1 If Regdst = 0, all R-format instructions will not work properly because we willspecify the wrong register to write. If ALUSrc = 0, then all I-format instructions except branch will not work because we will not be able to get the sign-extended 16 bits intothe ALU. If MemtoReg = 0, then loads will not work. If Zero = 0, the branch instructionwill never branch, even when it should.

5.2 If Regdst = 1, loads will not work (bad destination). If ALUSrc = 1, then all R-for-mat instructions as well as branch will not work properly because the second registerread will not get into the ALU. If MemtoReg = 1, then all R-format instructions will notwrite the correct data into the register file. If Zero = 1, all branches will be taken all of the time, even if they shouldnt be.

5.3 If RegDst = 0, all R-format instructions will be unable to write into the proper reg-ister (rd). If MemtoReg = 0 or IorD = 0, then loads will not work. If ALUSrcA = 0, noneof the instructions will work properly because they all require A to be operated on bythe ALU at some point.

5.4 If RegDst = 1, load instructions will not work. If MemtoReg = 1, all R-format in-structions will not work. If IorD = 1, no instructions will work because the PC will nev-er be used to get an instruction. If ALUSrcA = 1, the program counter will not beproperly incremented by 4 and essentially all instructions will not work.

5.5 No additions to the datapath are required. A new row should be added to the truthtable in Figure 5.20. The new control is similar to load word because we want to use theALU to add the immediate to a register (and thus RegDst = 0, ALUSrc = 1, ALUOp =00). The new control is also similar to an R-format instruction, because we want towrite the result of the ALU into a register (and thus MemtoReg = 0, RegWrite = 1) andof course we arent branching or using memory (Branch = 0, MemRead = 0, MemWrite

= 0).5.6 We already have a way to change the PC based on the specified address (using thedatapath for the jump instruction), but well need a way to put PC+4 into register $ra(31), and this will require changing the datapath. We can expand the multiplexorcontrolled by RegDst to include 31 as a new input. We can expand the multiplexor con-trolled by MemToReg to have PC+4 as an input. Because we expand these multiplex-ors, the entire columns for RegDst and MemtoReg must change appropriately in thetruth table. The jal instruction doesnt use the ALU, so ALUSrc and ALUOp can bedont cares. Well have Jump = 1, RegWrite = 1, and we arent branching or using mem-ory (Branch = 0, MemRead = 0, MemWrite = 0).

5.7 One possible solution is to add a new control signal called Invzero that selectswhether Zero or inverted Zero is an input to the AND gate used for choosing what the

new PC should be (thus a new multiplexor is introduced). The new control signalInvzero would be a dont care whenever the control signal Branch is zero. Many othersolutions are possible.

5 Solutions


47/52


Note to instructors: Sometimes different solutions submitted by students will in fact beidentical (e.g., one student draws a multiplexor, the other draws gates, but in fact they both implemented it the same wayjust at different levels of abstraction). You maywant to look for solutions that you can use to emphasize this point to your students.For example, an alternative solution would be to use Zero as an input to a new multi-plexor that selects between Branch and a new control signal Bnequal. In this case,there would be no dont cares because if we arent branching, both Branch andBnequal need to be zero. Some students may present this exact solution in gate forminstead of in mux form.

5.8 No changes are needed in the datapath. The new variant is just like lw except thatthe ALU will use the Read data 2 input instead of the sign-extended immediate. Of course the instruction format will need to change: the register write will need to bespecified by the rd field instead of the rt field. However, all needed paths are alreadyin place for the sake of R-format instructions. To modify the control, we simply needto add a new row to the existing truth table. For the new instruction, RegDst = 1,ALUSrc = 0, MemtoReg = 1, RegWrite = 1, MemtoReg = 1, MemWrite = 0, ALUop = 00.

5.9 There are no temporary registers that can be made to hold intermediate (tempo-rary) data. There are no intermediate edges to clock them. There is no state machine tokeep track of whether rs has been updated or both rs and rt have been updated. Inessence, we cant order things.

5.10 Looking at Figure 5.20, we see that MemtoReg and MemRead are identical exceptfor sw and beq , for which MemtoReg is a dont care. Thus, the modification will workfor the single-cycle datapath. The modification will also work on the multiple-cycledatapath assuming that the finite state machine is changed so that MemRead isasserted whenever MemtoReg is.

5.11 Using Figure 5.20, we discover that MemtoReg could be replaced by ALUSrc,RegDst could be replaced by ALUOp1, and either Branch or ALUOp0 could be re-placed in favor of the other (their signals are identical). Note that in reality there wouldlikely be additional rows present in the truth table to support other instructions, and it

is quite likely that no control signals could be eliminated.5.12 The key is recognizing that we no longer have to go through the ALU and thento memory. We would not want to add zero using the ALU, instead we want to providea path directly from the Read data 1 output of the register file to the read/write addresslines of the memory (assuming the instruction format does not change). The output of the ALU would no longer connect to memory. The control does not need to change, butsome of the control signals now are dont cares. Assuming we are not implementingaddi or addiu , it is possible to remove AluSrc and the multiplexor it controls by having just the data from Read data 2 going into the ALU. This results in additional optimiza-tions to ALU control.


48/52


5.13 The cycle time goes down, from 8 ns to 6 ns, because we no longer need to gothrough the ALU to calculate an address. Consider executing 10 instructions, one of which is a load word with an offset. Previously this would have taken 10 8 = 80 ns.Now we require an additional addi instruction to compute the offset, and it takes (10+ 1) 6 = 66 ns. So clearly we can tolerate more than 10%. Algebraically, 8 I = 6 (X + 1)I, where I is the number of instructions and X is the percentage we are solvingfor. Thus, X = .33 and we can tolerate 33% of the instructions being loads and storeswith offsets. This shouldnt be surprising, because we sped up the cycle time by a fac-tor of 8/6 = 1.33.

5.14 The key is to understand that it is the length of the longest path in the combina-tional logic that is determining the cycle time. Essentially, we compute the length of thelongest path for each instruction and then must take the one with maximum value. Atpresent, the lw instruction is providing the longest path of length 8 ns.

a. The changes do not increase any paths beyond current maximums. Thus thecycle time is still 8 ns.

b. Consider the beq instruction. We now have a path which requires 10 ns to com-pute the branch address (X + Y = 10 ns), and this is maximum. Thus, the cycletime increases to 10 ns.

c. You might be tempted to again conclude that the branch instruction will requireX + Y = 9 ns for branch address computation. This is not quite correct, however, because the second adder (requiring Y time) has two inputs, one of which is notavailable until after the instruction is read (the 16 immediate bits), and this takes2 ns. Thus, the actual maximum length path is again 10 ns, and the cycle time is10 ns.

5.15 The existing datapath is sufficient. The execution steps for addi would be

Instruction fetch (unchanged)

Instruction decode and register fetch (unchanged)

Execution: ALUOut = A + sign-extend (IR[15-0]);

Instruction Completion: Reg[IR[20-16]] = ALUOut

The rst three steps are identical to those performed for memory access instructions,so we can use them (we must add Op = addi to the transition from state 1 to state 2).The fourth step is different and not like any existing step; thus we must add a newstate to our nite state machine (i.e., state 10), a transition from state 2 to state 10 if Op=addi, and a transition from state 10 back to state 0. In state 10 we have RegDst =0, RegWrite, MemtoReg = 0.

5.16 We need to change Dispatch 1 and Dispatch 2 and introduce a new entry(perhaps inserted after SW2) with a label of ADDI2 and a register control of WriteALU rt, which is similar to Write ALU except that it uses the rt field instead of rd.You could argue that further changes should be made to existing labels to make thingseasier to understand.


49/52


50/52


5.21 The instruction will require using the ALU twice. A multiplexor should be addedto permit the ALU to use the previous result produced by the ALU (stored in ALUOut)as an input for the second addition. Note that changes will also be needed for registerread. The instruction can be implemented in 5 cycles by reading the third registersource during the cycle in which the first addition is taking place. Clearly a multiplexorand a new control signal will be needed for the inputs to one of the register read ports.

5.22 The rt field of the jump register instruction contains 0. So, along with rs we read$0. To get rs to the output of the ALU, we can perform a dummy addition of rs + $0.We already have a path from ALUOut to the PC. We can use this path for loading thePC with rs. The finite state machine will need a new state for when the opcode is 000000and the function code is 001000. The current condition for entering state 6 needs to bemodified to make sure the function code is not 001000. In the new state, we want theappropriate combination of states 0 and 6: ALUSrcA = 1, ALUSrcB = 00, PCWrite, PC-Source = 00, and of course we make sure an addition is performed with ALUOp = 00.

5.23 Many solutions are possible. In all of them, a multiplexor will be needed as wellas a new control signal (e.g., RegRead) to select which register is going to be read (i.e.,using IR[25-11] or IR[20-16] ). One simple solution is simply to add a write signalto A and break up state 1 into two states, in which A and B are read. It is possible toavoid adding the write signal to A if B is read first. Then A is read and RegRead is heldstable (because A always writes). Alternatively, you could decide to read A first be-cause it may be needed to calculate an address. You could then postpone reading B un-til state 2 and avoid adding an extra cycle for the load and store instructions. An extracycle would be needed for the branch and R-type instructions.

5.24 The instruction mix is

Inst ruct ion cl ass Fr equency

Loads 21%Stores 12%R-type 46%

Jump/ branch 21%


51/52


The CPIs for each machine by instruction class are

So the original machine M1 is fastest for this mix. M3 would be the fastest for aninstruction mix that contained only loads.

5.25 The gcc data from Chapter 4 has the following instruction mix:

Here are the CPIs:

The 750 MHz is faster by .

Inst ruct ion class M1 CPIs M2 CPIs M3 CPIs

Loads 5 4 3Stores 4 4 3R-type 4 3 3

Jump/branch 3 3 3Effective CPI 4.00 3.33 3

Inst ruct ion cl ass Fr equency

Loads 21%Stores 12%R-type 46%

Jump/branch 21%

Inst ruct ion class CPI on 500-MHz machine CPI on 750-MHz machine

Loads 5 6Stores 4 5R-type 4 4

Jump/branch 3 3

MIPS M15004.00---------- 125.0= =

MIPS M24003.33----------

120.1= =

MIPS M32503.0---------

83.3= =

CPI 500 MHz 0.21 5 0.12 4 0.46 4 0.21 3 + + + 4.00= =

CPI 750 MHz 0.21 6 0.12 5 0.46 4 0.21 3 + + + 4.33= =

7504.33----------

5004.00-------------------- 173.21125.00---------------- 1.39= =


52/52

Documents

computer organization and design Chap1-5