IEEE Carl Hamacher

IEEE TRANSACTIONS ON COMPUTERS, VOL. c-22, NO. 2, FEBRUARY 1973

An Augmented Iterative Array ror

High-Speed Binary Division

MAURUS CAPPA AND V. CARL HAMACHER

Abstract-An augmented iterative array for binary division(IAD), is described. It uses carry-save reduction and carry-look-ahead principles to achieve high speed. Logic cost and speed com-parisons with two other design techniques are presented. An 8-bitprototype model that operates in under 500 ns has been built fromcommercially available high-speed MSI TTL integrated circuits toverify the feasibility of the IAD scheme.

Index Terms-Division network, iterative array, nonrestoringdivision.

I. INTRODUCTION

LARGE general-purpose digital computers requirehigh-speed arithmetic units. There are numerousspecial-purpose digital processors, such as fast

Fourier transform machines, that also have this require-ment. High-speed addition [1] and multiplication [2 ]-[4] have been reasonably thoroughly studied. The itera-tive logic cell has been stressed as an important designfactor from a fabrication and diagnostic standpoint.The purpose of this paper is to present the design of

a new high-speed iteratively structured binary divisionarray named IAD (iterative array divider). The logiccost and speed of this design is compared to a basicdivision array similar to those proposed in [5], [6], andto another recent high-speed scheme [7 ]. The construc-tion of an 8-bit prototype divider using standard high-speed MSI TTL integrated circuits is briefly describedin a final section.

II. BASIC DIVISION ARRAYAn iterative array which implements the conventional

nonrestoring division algorithm is presented here. It isessentially the same as arrays proposed by others [5],[6] and is included here as the starting point from whichtwo design changes lead to the much faster IAD pre-sented in Section III.

In binary division, successively right-shifted versionsof the divisor are subtracted from or added to thedividend and resulting partial remainders. The sign ofthe partial remainder determines the quotient bit andfurther, in nonrestoring division, determines whether toadd or subtract the shifted divisor in the next cycle. The

Manuscript received May 8, 1972; revised August 28, 1972. Thiswork was supported in part by the scholarship program and GranitA-5192 of the National Research Council of Canada.

M. Cappa is with Collins Radio Company, Toronto, Ont., Can-ada.

V. C. Hamacher is with the Departments of Electrical Engineer-ing and Computer Science, University of Toronto, Toronto, Ont.,Canada.

basic step of the nonrestoring algorithm can be con-cisely stated with the help of the following notation:

dividend: A = A O . A 1A 1... AN

divisor: D = DO. D1D2. . DN(partial) remainder: R = RO. R1R2. . RN

quotient: Q = QO. Q1Q2 ... QN.

The binary point is assumed to lie between the 0 and1 subscripted components of each vector. The operandsare assumed to be positive, normalized fractions, so thatAo=Do=0 and A1=Di=1. Since 1.<A, D<1, the quo-tient is positive and lies in the range 2 < Q < 2. The (par-tial) remainder R is a signed fraction, and RQ is the signbit with R being represented in 2's complement form.An N+1-bit quotient can then be generated from an

N-bit divisor and dividend by the following (nonrestor-ing) binary division algorithm.DO FOR i= 0 to N.

Step 1:1 Generate new partial remainder: if Q1-l= 1,R*-R-D; if Qi-l=O, R÷-R+D.

Step 2: Quotient bit: set Qi=Ro.Step 3:2 Shift partial remainder: R -2(R).

A two-dimensional array of iteratively structuredlogic can implement the nonrestoring algorithm di-rectly. Basically, the array consists of rows of carry-propagate adders with one controlled input per bit. SeeFig. 1(a) and (b) for an example with a 4-bit divisor andan 8-bit dividend. Each logic cell consists of a full adderand an EXCLUSIVE-OR gate. The EXCLUSIVE-OR gate con-trols the divisor input to the full adder. The control sig-nal S determines whether an add or subtract is to beperformed. Subtraction is performed in 2's complementform by forming the l's complement of the divisor andforcing a carry into the low-order bit position. Theoperation of the array is mostly self-explanatory. The4-bit divisor and double-length 8-bit dividend are intro-duced at the top and right edges of the array. SinceAo=Do=O and Ai=D1=1, the upper left three cellscould have been eliminated provided that all input-out-put connections are sent directly through the originallocations of the cells. A 5-bit quotient is developed at theleft of the array. Note that each quotient bit is passed onto the next row as the control signal S and that this

1For i=0, R=A =dividend and formally Qi =1.2 If a douible-length dividend is provided, its right half is brought

into RN a bit at a time by the applicatioin of Step 3.

172

CAPPA AND HAMACHER: ARRAY FOR BINARY DIVISION

(a)

(b)Fig. 1. Basic division array. (a) Typical cell. (b) Example:

8-bit dividend, 4-bit divisor.

signal is the low-order carry into the rightmost cell. Thereason that the carry out signal from the left cell of eachrow is equal to the quotient bit for that position is a

consequence of the use of 2's complement number repre-

sentation and the nonrestoring algorithm.This is a reasonable operational interpretation of the

use of the array in a floating-point arithmetic unit wherethe input operands are normalized fractions. However,the array can also be interpreted as performing integerdivision in which a double-length dividend (in this case

8 bits) is introduced at the top and right diagonal of thearray. Bit positions A8, D4, and Q4 in Fig. 1(b) then havethe binary weighting of unity, and the divisor should belarger than the left half of the dividend.

III. HIGH-SPEED DIVISION ARRAY (IAD)The technique used in the IAD [8 ] is best understood

as two rather major modifications of the implementa-tion of the nonrestoring division algorithm described inSection II. The basic idea of the design is to eliminatethe carry ripple time, which is proportional to N, alongeach row of the array. First, the partial remainder R isnot actually developed in each row of the array, but isrepresented by two binary vectors S and C which, ifadded, would produce the correct partial remainder atthat row level. Second, a single carry-lookahead subnet-work is used to determine from the S and C vectors whatthe carry into the sign bit would be, facilitating the de-termination of the sign of the resulting partial re-

mainder, the quotient bit for that row level, and the con-

trol (add or subtract divisor) to the next row. The twovectors S and C are the result of a 3-to-2 carry-save re-

duction on the previous row's S and C vectors and theproper version of the shifted divisor (add or subtract).

Subtraction of the divisor is again implemented using2's complement addition as in the basic array. Thecarry-save reduction operation is achieved in only onecell delay, independent of the length of the operands.The speed of the lookahead network, divisor true orcomplement selection, and the single-cell delay forcarry-save reduction are the basic factors determiningthe delay per bit of the quotient Q being developed.The notation introduced in Section II can be used to

describe the IAD technique. Vectors S and C are thesum and carry vectors, respectively, resulting from a3-to-2 carry-save addition reduction. The IAD algo-rithm is then as follows.DO FOR i =O to N.

Step 1:3 Generate the two-vector partial remainder:if Qil=1, S', C'<-S+C-D; if Qi-l=0, S', C'<-S+C+D.

Step 2: Lookahead: CL = G1+P1G2+P1P2G3++P1 ... PPN2GvGT where Gj= S'jC'j (generate func-tion) and Pj = Sj'+ C'j (propagate function).

Step 3: Partial remainder sign: Ro=So(DC'0o(DCLand quotient bit: Qi = Ro.

Step 4:4 Shift partial remainder: S, C-2(S', C').

The general Step 1, formulated above, can be illus-trated by the following operation:

SO .S1S2 ...SN 1AN+iCO . C1C2 ... .0 Qi-1

+ Do. D1D2 ... DN-1DNSO' *S1'S2' .**SN-1SN

Co' Cl'C2'. . . CN-I 0

CL determines lookahead bit

Q= Ro. (1)

This nonrestoring algorithm which uses lookaheadand carry-save addition reduction to speed up thedivision process can be implemented best in an aug-mented array using three types of logic cells and a minoramount of additional logic. The logic cells are illustratedin Fig. 2(a) and the array schematic for the same sizeoperands as in Fig. 1 is shown in Fig. 2(b). The CLA cellincorporates logic to determine the carry into the signposition of the partial remainder. This logic is depen-dent on the word length N, and for long word lengths(N. 10), two levels of lookahead must be used in orderto minimize fan-in. A maximum fan-in of .8 has beenassumed for the analysis of performance in Section IV.

Since the form of the lookahead network CLA variesas a function of wordlength N, it is not strictly techni-cally correct to call IAD an iterative array in the ac-

3For i=O: S=A =dividend, Q-1=1 =CN, and C=O . * 1.I After the shift, the low-order bits of a double-length dividend

can be introduced into the vacated Sv position, one bit at a time, andCN can be set to Qi to accomplish the 2's complementing functionon D.

173

IEEE TRANSACTIONS ON COMPUTERS, FEBRUARY 1973

Dt SCt

,Ce,<C \-IV P S'C' Di

24

° 22 t

Id0

'T 6

< 4-

az 2-

(a)

tn

0

+

0U

/BASICARRAY

IAD

BREAKPOINT

4 8 16 32OPERAND LENGTH (bits)

(a)

(b)Fig. 2. IAD system. (a) The typical cells. (b) Example:

8-bit dividend, 4-bit divisor.

cepted sense of that term which usually means a singlefixed-cell design repeated in a regular pattern to ac-commodate any operand length. That is why we haveadded the modifier "augmented" at the beginning. Theleft-edge sign cells S and the EXCLUSIVE-OR gates areanother departure from strict uiniformity.

It should be noted at this point that simply addinglookahead alone to the basic array of Section II doesnot achieve the same results as the IAD. Much morelookahead circuitry would be needed to generate allpartial remainder bits quickly in each row to achievethe same speed as in the IAD. It is the combination ofcarry-save reduction and sign-bit lookahead thatachieves the high speed at reasonable cost.The S cells are in the sign position of the partial

remainder and compute the binary sum of Do, So, COfrom operation (1), in the FULL ADDER. This sum is thenadded (in the EXCLUSIVE-OR gate) to the carry outputCo' from the 3-to-2 reduction in the adjacent A cell.Finally, the result is added to the carry-lookahead bitCL in the EXCLUSIVE-OR gate at the output of the CLAcell, the answer being Ro, which is Qi for the ith row.The quotient bit Qi is actually derived from the inverteroutput in the S cell of the next row.The A cells which constitute the main body of the

array perform the 3-to-2 carry-save reduction calledfor in Step 1 of the IAD algorithm and illustrated in (1).The functions P (propagate) and G (generate) are alsocomputed in the A cell.

IV. IAD PERFORMANCEIt is important to note that the IAD presented in

Section III results in division times that essentiallyincrease only linearly with word length. The trend is

OPERAND LENGTH (bits)

(b)Fig. 3. Comparison of IAD and basic array. (a) Speed of IAD

versus basic array. (b) Cost of IAD versus basic array.

actually proportional to N log N where the log N termresults from the lookahead function, with the maximumgate fan-in determining the base of the logarithm. Thedivision time in the basic design of Section II increasesproportional to N2 due to the carry propagation alongeach row.The two designs will be compared with respect to

speed and cost for word lengths from 4 bits to 32 bits.From 10 bits up, two levels of lookahead are needed inthe CLA cells of the IAD to maintain the worst casefan-in at 8 inputs. Speed is evaluated in terms of thenumber of gate delays (each gate having an assigneddelay of one unit) and cost in terms of gates plus inputs.All synthesis is evaluated in terms of AND, OR, and NOTgate implementations of the various cell functions. Eventhough this may be only a rough gauge of actual delaywhen integrated circuits are used, it is valid as a com-parative tool when considering alternate designs. Worstcase delays for a full N+1-bit quotient are as follows.Basic array: T=2N2+7N+5, where the delay per bitis 3 for the EXCLUSIVE-OR gate plus 2(N+1) for carryripple; and IAD: T=12(N+1) for N<9, or 14(N+1)for 10<N<32 where the delay per bit is 7 for the A cellplus 2 (or 4) for the CLA cell plus 3 for the EXCLUSIVE-OR gate. It should be noted that a functionally equiva-lent A cell having lower delay but more gates could bedesigned. Similarly, the CLA cell and the EXCLUSIVE-ORgate could be combined to reduce overall delay, again atincreased cost.Approximate speed and cost curves are plotted in

Fig. 3(a) and (b). Note that at N= 16 bits, the IAD isabout 2.6 times faster than the basic array and about22 percent more expensive; while at N= 32 bits, thefigures are five times faster and 25 percent more expen-

174

CAPPA AND HAMACHER: ARRAY FOR BINARY DIVISION

sive. The number of gates and inputs involved foreither array at the larger values of N is clearly onlyfeasible if the arrays are implemented using reasonablylarge-scale integration of sublogic blocks.

It is also instructive to compare the IAD design pro-

cedure with another recent high-speed divider designtechnique reported by Stefanelli [7]. Working with hisbest design, procedure four, it is easy to see that N quo-

tient bits are computed in a delay of approximately

N(rm + 2T,,± r8) + INTbdivider converter

where Tm, Tr,, and rT are the delays through the multi-plier gate, parallel counter, and subtractor of [7, fig. 10]and rb is the delay through each box of the converter of[7, fig. 11]. These circuits are actually ternary, althoughthe external inputs and outputs (divisor, dividend, andquotient) are binary. For comparison to the IAD, it isreasonable to assume at least three units of delay foreach of rTp, T., and rb; this being the delay through a

standard two-level binary synthesis of each function as-

suming binary-coded ternary representation, and as-

suming input complements are not available. One unitof delay is assigned to rm since it corresponds to an

AND-gate function. Therefore, total delay through theStefanelli divider (procedure four) is approximately13N compared to 14N for IAD. A cost comparison be-tween these two dividers is somewhat more difficultbecause no detailed synthesis has been specified in theStefanelli paper. In any event, a reasonably straight-forward line of reasoning leads to the conclusion that a

comparison of the array costs can be approximated bya comparison of the cost of the A cell used in the IADwith the cost of a four-input two-output ternary parallelcounter (Stefanelli) since these two modules are thebasic components in the main body of the respectivearrays. The A cell of the IAD should be substantiallycheaper than any implementation of the Stefanellicounter.

It should also be noted that the size of the Stefanellicounter is actually a slowly varying function of N,which is difficult to determine, and the above size (fourinputs, two outputs) is for the case of N= 12. For largerN, a slightly larger counter would be required.A comment should be made about fan-out problems.

They certainly exist in all three designs discussed in thepreceding, for the larger values of N, but their elimina-tion would affect the cost and speed of each design to

about the same degree, so that the parameters com-

puted above remain a valid indication of the relativemerits of the various design procedures.

V. CONSTRUCTION OF AN 8-BIT DIVIDER

In order to verify the practicality of the IAD designprocedure, a prototype divider was built from com-

mercially available integrated circuits. In the prototype,

a 16-bit dividend is divided by an 8-bit divisor yieldingan 8-bit quotient. The total network is mounted on

eight standard cards, holding a total of 136 IC's, eachcard generating one quotient bit. The IC technologyused was SSI and MSI high-speed TTL with a basicgate delay of about 6 ns. The worst case delay perquotient bit was computed to be 65 ns implying that alleight quotient bits could be generated in about 512 ns.Experimentally, worst case delays of under 500 ns weremeasured. Since we have demonstrated that speed isalmost linear with operand length, it is feasible to con-struct combinational division circuits from currentlyavailable high-speed TTL IC's that operate withinabout a memory cycle (1 ,us) in a 12- or 16-bit computer,assuming that a word-length accuracy quotient is gen-erated. Whether or not the cost can be justified clearlydepends on the application.

REFERENCES[1] 1. Flores, The Logic of Computer Arithmetic. Englewood Cliffs,

N. J.: Prentice-Hall, 1963.[2] F. C. Hennie, Finite-State Models for Logical Machines. New

York: Wiley, 1968, ch. 10.[3] A. Habibi and P. A. Wintz, "Fast multipliers," IEEE Trans.

Comput. (Short Notes), vol. C-19, pp. 153-157, Feb. 1970.[4] S. D. Pezaris, "A 40-ns 17-bit by 17-bit array multiplier," IEEE

Trans. Comput. (Short Notes), vol. C-20, pp. 442-447, Apr. 1971.[5] J. C. Majithia, "Nonrestoring binary division using a cellular

array," Electron. Lett., vol. 6, pp. 303-304, 1970.[6] H. H. Guild, "Some cellular logic arrays for non-restoring binary

division," Radio Electron. Eng., vol. 39, pp. 345-348, 1970.[7] R. Stefanelli, "A suggestion for a high-speed parallel binary

divider," IEEE Trans. Comput., vol. C-21, pp. 42-55, Jan. 1972.[8] M. Cappa, "Cellular iterative arrays for multiplication and divi-

sion, " M.S. thesis, Dep. Elec. Eng., Univ. Toronto, Ont., Canada,Oct. 1971.

Maurus Cappa was born in Barile, Italy, onDecember 22, 1947. He received the B.A.Sc.and the M.A.Sc. degrees in electrical engineer-ing from the University of Toronto, Toronto,Ont., Canada, in 1970 and 1971, respectively.His gradutate work in the Comptuter Group ofElectrical Engineering at the University ofToronto was sponsored by a National Re-search Council Scholarship.

He is currently with Collins Radio Com-pany, Toronto, Ont., Canada, where he is in-

volved in the design of frequency synthesizers for use in militarytransceivers.

V. Carl Hamacher (S'66-M'68) was born inLondon, Ont., Canada, on September 28,1939. He received the B.A.Sc. degree in engi-oneering physics from the University of Water-loo, Waterloo, Ont., Canada, in 1963, the

g M.Sc. degree in electrical engineering fromQueen's University, Kingston, Ont., Canada,in 1965, and the Ph.D. degree in electricalengineering from Syracuse University, Syra-cuse, N. Y., in 1968.

He was appointed Assistant Professor inboth Electrical Engineering and Computer Science at the Universityof Toronto, Toronto, Ont., Canada, in 1968, and is currenitly an As-sociate Professor in those departments. His research interests includeautomata studies, iterative array structures. and computer organiza-tion.

Dr. Hamacher is a member of the Association for ComputingMachinery and Sigma Xi.

175

Documents

IEEE Carl Hamacher