Bit Serial multiplier using Verilog

BIT-SERIAL MULTIPLIER USING VERILOG HDL

A

Mini Project Report

Submitted in the Partial Fulfillment of the

Requirements

for the Award of the Degree of

BACHELOR OF TECHNOLOGY

IN

ELECTRONICS AND COMMUNICATION ENGINEERING

Submitted

By

K.BHARGAV 11885A0401

P.DEVSINGH 11885A0404

Under the Guidance of

Mr. S. RAJENDAR

Associate Professor

Department of ECE

Department of Electronics and Communication Engineering

VARDHAMAN COLLEGE OF ENGINEERING (AUTONOMOUS)

(Approved by AICTE, Affiliated to JNTUH & Accredited by NBA)

2013- 14

VARDHAMAN COLLEGE OF ENGINEERING (AUTONOMOUS)

Es td.1999 Shamshabad, Hyderabad – 501218

Kacharam (V), Shamshabad (M), Ranga Reddy (Dist.) – 501 218, Hyderabad, A.P. Ph: 08413-253335, 253201, Fax: 08413-253482, www.vardhaman.org

Department of Electronics and Communication Engineering

CERTIFICATE

This is to certify that the mini project report work entitled “Bit-Serial Multiplier Using

Verilog HDL” carried out by Mr. K.Bhargav, Roll Number 11885A0401, Mr. P.Devsingh, Roll

Number 11885A0404, submitted to the department of Electronics and Communication

Engineering, in partial fulfillment of the requirements for the award of degree of Bachelor of

Technology in Electronics and Communication Engineering during the year 2013 – 2014.

Name & Signature of the Supervisor

Mr. S. Rajendar

Associate Professor

Name & Signature of the HOD

Dr. J. V. R. Ravindra

Head, ECE

iii

ACKNOWLEDGEMENTS

The satisfaction that accompanies the successful completion of the task would be

put incomplete without the mention of the people who made it possible, whose constant

guidance and encouragement crown all the efforts with success.

I express my heartfelt thanks to Mr. S. Rajendar, Associate Professor, technical

seminar supervisor, for her suggestions in selecting and carrying out the in-depth study of

the topic. Her valuable guidance, encouragement and critical reviews really helped to

shape this report to perfection.

I wish to express my deep sense of gratitude to Dr. J. V. R. Ravindra, Head of

the Department for his able guidance and useful suggestions, which helped me in

completing the technical seminar on time.

I also owe my special thanks to our Director Prof. L. V. N. Prasad for his intense

support, encouragement and for having provided all the facilities and support.

Finally thanks to all my family members and friends for their continuous support

and enthusiastic help.

K.Bhargav 11885A0401

P.Devsingh 11885A0404

iv

ABSTRACT

Bit-serial arithmetic is attractive in view of it is smaller pin count, reduced wire

length, and lower floor space requirement in VLSI. In fact ,the compactness of the design

may allow us to run a bit-serial multiplier at a clock rate high enough to make the unit

almost competitive with much more complex designs with regard to speed. In addition, in

certain application contexts inputs are supplied bit-serially anyway. In such a case, using

a parallel multiplier would be quite wasteful, since the parallelism may not lead to any

speed benefit. Furthermore, in applications that call for a large number of independent

multiplications, multiple bit-serial multiplier may be more cost-effective than a complex

highly pipelined unit.

Bit-serial multipliers can be designed as systolic arrays: synchronous arrays of

processing element that are interconnected by only short, local wires thus allowing very

high clock rates. Let us begin by introducing a semi systolic multiplier, so named because

its design involves broadcasting a single bit of the multiplier x to a number of circuit

element, thus violating the “short, local wires” requirement of pure systolic design.

v

CONTENTS

Acknowledgements (iii)

Abstract (iv)

List Of Figures (vii)

1 INTRODUCTION 1

1.1 The Context of Computer Arithmetic 1

1.2 What is computer arithmetic 2

1.3 Multiplication 4

1.4 Organization of report 5

2 VLSI 6

2.1 Introduction 6

2.2 What is VLSI? 7

2.2.1 History of Scale Integration 7

2.3 Advantages of ICs over discrete components 7

2.4 VLSI And Systems 8

2.5 Applications of VLSI 8

2.6 Conclusion 9

3 VERILOG HDL 10

3.1 Introduction 10

3.2 Major Capabilities 11

3.3 SYNTHESIS 12

3.4 Conclusion 12

4 BIT-SERIAL MULTIPLIER 14

4.1 Multiplier 14

4.2 Background 14

4.2.1 Binary Multiplication 15

vi

4.2.2 Hardware Multipliers 15

4.2.3 Array Multipliers 16

4.3 Variations in Multipliers 18

4.4 Bit-serial Multipliers 19

5 IMPLEMENTATION 22

5.1 Tools Used 22

5.2 Coding Steps 22

5.3 Simulation steps 22

5.4 Full adder code 23

5.5 Full adder flowchart 24

5.6 Full adder testbench 24

5.7 Bit-serial multiplier algorithm 25

5.8 Bit-Serial multiplier code 25

5.9 Full adder waveform 26

5.10 Bit-serial multiplier testbench 26

5.11 Bit-serial multiplier waveforms 27

6 CONCLUSIONS 28

REFERENCES 29

vii

LIST OF FIGURES

3.1 Mixed level modeling 11

3.2 Synthesis process 12

3.3 Typical design process 13

4.1 Basic Multiplication Data flow 15

4.2 Two Rows of an Array Multiplier 17

4.3 Data Flow through a Pipelined Array Multiplier 18

4.4 Bit-serial multiplier; 4x4 multiplication in 8 clock cycles 19

4.5 Bit Serial multiplier design in dot notation 21

5.1 Project directory structure 22

5.2 Simulation window 23

5.3 Waveform window 23

5.4 Full adder flowchart 24

5.5 Bit-Serial multiplier flowchart 25

5.6 Full adder output waveforms 26

5.7 Bit serial multiplier input/output waveforms 27

5.8 Bit serial multiplier with intermediate waveforms 27

1

CHAPTER 1

INTRODUCTION

1.1 The Context of Computer Arithmetic

Advances in computer architecture over the past two decades have allowed the

performance of digital computer hardware to continue its exponential growth, despite

increasing technological difficulty in speed improvement at the circuit level. This

phenomenal rate of growth, which is expected to continue in the near future, would not

have been possible without theoretical insights, experimental research, and tool-building

efforts that have helped transform computer architecture from an art into one of the most

quantitative branches of computer science and engineering. Better understanding of the

various forms of concurrency and the development of a reasonably efficient and user-

friendly programming model has been key enablers of this success story.

The downside of exponentially rising processor performance is an unprecedented

increase in hardware and software complexity. The trend toward greater complexity is not

only at odds with testability and verifiability but also hampers adaptability, performance

tuning, and evaluation of the various trade-offs, all of which contribute to soaring

development costs. A key challenge facing current and future computer designers is to

reverse this trend by removing layer after layer of complexity, opting instead for clean,

robust, and easily certifiable designs, while continuing to try to devise novel methods for

gaining performance and ease-of-use benefits from simpler circuits that can be readily

adapted to application requirements.

In the computer designers’ quest for user-friendliness, compactness, simplicity,

high performance, low cost, and low power, computer arithmetic plays a key role. It is

one of oldest subfields of computer architecture. The bulk of hardware in early digital

computers resided in accumulator and other arithmetic/logic circuits. Thus, first-

generation computer designers were motivated to simplify and share hardware to the

extent possible and to carry out detailed cost- performance analyses before proposing a

design. Many of the ingenious design methods that we use today have their roots in the

bulky, power-hungry machines of 30-50 years ago.

In fact computer arithmetic has been so successful that it has, at times, become

transparent. Arithmetic circuits are no longer dominant in terms of complexity; registers,

memory and memory management, instruction issue logic, and pipeline control have

become the dominant consumers of chip area in today’s processors. Correctness and high

performance of arithmetic circuits is routinely expected, and episodes such as the Intel

2

Pentium division bug are indeed rare.

The preceding context is changing for several reasons. First, at very high clock

rates, the interfaces between arithmetic circuits and the rest of the processor become

critical. Arithmetic units can no longer be designed and verified in isolation. Rather, an

integrated design optimization is required, which makes the development even more

complex and costly. Second, optimizing arithmetic circuits to meet design goals by taking

advantage of the strengths of new technologies, and making them tolerant to the

weaknesses, requires a reexamination of existing design paradigms. Finally, incorporation

of higher-level arithmetic primitives into hardware makes the design, optimization, and

verification efforts highly complex and interrelated.

This is why computer arithmetic is alive and well today. Designers and

researchers in this area produce novel structures with amazing regularity. Carry-

lookahead adders comprise a case in point. We used to think, in the not so distant past,

that we knew all there was to know about carry-lookahead fast adders. Yet, new designs,

improvements, and optimizations are still appearing. The ANSI/IEEE standard floating-

point format has removed many of the concerns with compatibility and error control in

floating-point computations, thus resulting in new designs and products with mass-market

appeal. Given the arithmetic-intensive nature of many novel application areas (such as

encryption, error checking, and multimedia), computer arithmetic will continue to thrive

for years to come.

1.2 What is computer arithmetic

A sequence of events, begun in late 1994 and extending into 1995, embarrassed

the world’s largest computer chip manufacturer and put the normally dry subject of

computer arithmetic on the front pages of major newspapers. The events were rooted in

the work of Thomas Nicely, a mathematician at the Lynchburg College in Virginia, who

is interested in twin primes (consecutive odd numbers such as 29 and 31 that are both

prime). Nicely’s work involves the distribution of twin primes and, particularly, the sum

of their reciprocals S = 1/5 + 1/7 1/11+1/13 +1/17 +1/19+1/29+1/31+-+1/P +1/(p +2) + -

- -. While it is known that the infinite sum S has a finite value, no one knows what the

value is.

Nicely was using several different computers for his work and in March 1994

added a machine based on the Intel Pentium processor to his collection. Soon he began

noticing inconsistencies in his calculations and was able to trace them back to the values

computed for 1 / p and 1 / (p + 2) on the Pentium processor. At first, he suspected his own

programs, the compiler, and the operating system, but by October, he became convinced

3

that the Intel Pentium chip was at fault. This suspicion was confirmed by several other

researchers following a barrage of e-mail exchanges and postings on the Internet. The

diagnosis finally came from Tim Coe, an engineer at Vitesse Semiconductor. Coe built a

model of Pentium’s floating-point division hardware based on the radix-4 SRT algorithm

and came up with an example that produces the worst-case error. Using double-precision

floating- point computation, the ratio c = 4 195 835/3 145 727 = 1.333 820 44- - - is

computed as 1.333 739 06 on the Pentium. This latter result is accurate to only 14 bits;

the error is even larger than that of single-precision floating-point and more than 10

orders of magnitude worse that what is expected of double-precision computation.

The rest, as they say, is history. Intel at first dismissed the severity of the problem

and admitted only a “subtle flaw,” with a probability of 1 in 9 billion, or once in 27,000

years for the average spreadsheet user, of leading to computational errors. It nevertheless

published a “white paper” that described the bug and its potential consequences and

announced a replacement policy for the defective chips based on “customer need”; that is,

customers had to show that they were doing a lot of mathematical calculations to get a

free replacement. Under heavy criticism from customers, manufacturers using the

Pentium chip in their products, and the on-line community, Intel later revised its policy to

no-questions-asked replacement.

Whereas supercomputing, microchips, computer networks, advanced applications

(particularly chess-playing programs), and many other aspects of computer technology

have made the news regularly in recent years, the Intel Pentium bug was the first instance

of arithmetic (or anything inside the CPU for that matter) becoming front-page news.

While this can be interpreted as a sign of pedantic dryness, it is more likely an indicator

of stunning technological success. Glaring software failures have come to be routine

events in our information-based society, but hardware bugs are rare and newsworthy.

Within the hardware realm, we will be dealing with both general-purpose

arithmetic/logic units (ALUS), of the type found in many commercially available

processors, and special-purpose structures for solving specific application problems. The

differences in the two areas are minor as far as the arithmetic algorithms are concerned.

However, in view of the specific technological constraints, production volumes, and

performance criteria, hardware implementations tend to be quite different. General-

purpose processor chips that are mass-produced have highly optimized custom designs.

Implementations of 1ow-volume, special-purpose systems, on the other hand, typically

rely on semicustom and off-the-shelf components. However, when critical and strict

requirements, such as extreme speed, very low power consumption, and miniature size,

4

preclude the use of semicustom or off-the shelf components, the much higher cost of a

custom design may be justified even for a special-purpose system.

1.3 Multiplication

Multiplication (often denoted by the cross symbol "×", or by the absence of

symbol) is the third basic mathematical operation of arithmetic, the others being addition,

subtraction and division (the division is the fourth one, because it requires multiplication

to be defined). The multiplication of two whole numbers is equivalent to the addition of

one of them with itself as many times as the value of the other one; for example, 3

multiplied by 4 (often said as "3 times 4") can be calculated by adding 4 copies of 3

together: 3 times 4 = 3 + 3 + 3 + 3 = 12 Here 3 and 4 are the "factors" and 12 is the

"product". One of the main properties of multiplication is that the result does not depend

on the place of the factor that is repeatedly added to it (commutative property). 3

multiplied by 4 can also be calculated by adding 3 copies of 4 together: 3 times 4 = 4 + 4

+ 4 = 12. The multiplication of integers (including negative numbers), rational numbers

(fractions) and real numbers is defined by a systematic generalization of this basic

definition. Multiplication can also be visualized as counting objects arranged in a

rectangle (for whole numbers) or as finding the area of a rectangle whose sides have

given lengths. The area of a rectangle does not depend on which side is measured first,

which illustrates the commutative property. In general, multiplying two measurements

gives a new type, depending on the measurements. For instance: 2.5 meters \times 4.5

meters = 11.25 square meters 11 meters/second times 9 seconds = 99 meters The inverse

operation of the multiplication is the division. For example, since 4 multiplied by 3 equals

12, then 12 divided by 3 equals 4. Multiplication by 3, followed by division by 3, yields

the original number (since the division of a number other than 0 by itself equals 1).

Multiplication is also defined for other types of numbers, such as complex numbers, and

more abstract constructs, like matrices. For these more abstract constructs, the order that

the operands are multiplied sometimes does matter.

Multiplication often realized by k cycles of shifting and adding, is a heavily used

arithmetic operation that figures prominently in signal processing and scientific

applications. In this part, after examining shift/add multiplication schemes and their

various implementations, we note that there are but two ways to speed up the underlying

multi operand addition: reducing the number of operands to be added leads to high-radix

multipliers, and devising hardware multi operand adders that minimize the latency and/or

maximize the throughput leads to tree and array multipliers. Of course, speed is not the

only criterion of interest. Cost, VLSI area, and pin limitations favor bit-serial designs,

5

while the desire to use available building blocks leads to designs based on additive

multiply modules. Finally, the special case of squaring is of interest as it leads to

considerable simplification

1.4 Organization of report

This report starts with introduction to computer arithmetic and then introduces

multiplication. Then it explains implementation of one of the multiplier bit serial

multiplier.

Chapter 1: Introduction – This chapter explains importance of computer arithmetic and

multiplication in computations.

Chapter 2: VLSI – This chapter focuses on VLSI and its evolution, also its applications

and advantages

Chapter 3: Verilog HDL – This chapter explains how HDL’s reduce design cycle in VLSI

and automation makes faster implementation.

Chapter 4: Bit-serial multiplier – This chapter explains about multiplier and its types and

how bit serial multiplier is useful.

Chapter 5: Implementation – This chapter explains Implementation flow of Bit-serial

multiplier its Verilog code and output waveforms.

Chapter 6: Conclusions – This chapter summarizes Bit-serial multiplier and its future

improvements.

6

CHAPTER 2

VLSI

2.1 Introduction

Very-large-scale integration (VLSI) is the process of creating integrated

circuits by combining thousands of transistor-based circuits into a single chip. VLSI

began in the 1970s when complex semiconductor and communication technologies

were being developed. The microprocessor is a VLSI device. The term is no longer as

common as it once was, as chips have increased in complexity into the hundreds of

millions of transistors.

The first semiconductor chips held one transistor each. Subsequent advances

added more and more transistors, and, as a consequence, more individual functions or

systems were integrated over time. The first integrated circuits held only a few

devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making it

possible to fabricate one or more logic gates on a single device. Now known

retrospectively as "small-scale integration" (SSI), improvements in technique led to

devices with hundreds of logic gates, known as large-scale integration (LSI), i.e.

systems with at least a thousand logic gates. Current technology has moved far past

this mark and today's microprocessors have many millions of gates and hundreds of

millions of individual transistors.

At one time, there was an effort to name and calibrate various levels of large-

scale integration above VLSI. Terms like Ultra-large-scale Integration (ULSI) were

used. But the huge number of gates and transistors available on common devices has

rendered such fine distinctions moot. Terms suggesting greater than VLSI levels of

integration are no longer in widespread use. Even VLSI is now somewhat quaint,

given the common assumption that all microprocessors are VLSI or better.

As of early 2008, billion-transistor processors are commercially available, an

example of which is Intel's Montecito Itanium chip. This is expected to become more

commonplace as semiconductor fabrication moves from the current generation of 65 nm

processes to the next 45 nm generations (while experiencing new challenges such as

increased variation across process corners).

This microprocessor is unique in the fact that its 1.4 Billion transistor count,

capable of a teraflop of performance, is almost entirely dedicated to logic (Itanium's

transistor count is largely due to the 24MB L3 cache). Current designs, as opposed to

the earliest devices, use extensive design automation and automated logic synthesis to

7

lay out the transistors, enabling higher levels of complexity in the resulting logic

functionality. Certain high-performance logic blocks like the SRAM cell, however, are

still designed by hand to ensure the highest efficiency (sometimes by bending or

breaking established design rules to obtain the last bit of performance by trading

stability).

2.2 What is VLSI?

VLSI stands for "Very Large Scale Integration". This is the field which involves

packing more and more logic devices into smaller and smaller areas.

• Simply we say Integrated circuit is many transistors on one chip.

• Design/manufacturing of extremely small, complex circuitry using modified

semiconductor material

• Integrated circuit (IC) may contain millions of transistors, each a few mm in size

• Applications wide ranging: most electronic logic devices

2.2.1 History of Scale Integration

• late 40s Transistor invented at Bell Labs

• late 50s First IC (JK-FF by Jack Kilby at TI)

• early 60s Small Scale Integration (SSI)

o 10s of transistors on a chip

o late 60s Medium Scale Integration (MSI)

o 100s of transistors on a chip

• early 70s Large Scale Integration (LSI)

o 1000s of transistor on a chip

• early 80s VLSI 10,000s of transistors on a chip (later 100,000s & now 1,000,000s)

• Ultra LSI is sometimes used for 1,000,000s

2.3 Advantages of ICs over discrete components

While we will concentrate on integrated circuits, the properties of integrated

circuits-what we can and cannot efficiently put in an integrated circuit- largely

determine the architecture of the entire system. Integrated circuits improve system

characteristics in several critical ways. ICs have three key advantages over digital

circuits built from discrete components:

Size: Integrated circuits are much smaller-both transistors and wires are shrunk to

micrometer sizes, compared to the millimeter or centimeter scales of discrete

8

components. Small size leads to advantages in speed and power consumption, since

smaller components have smaller parasitic resistances, capacitances, and inductances.

Speed: Signals can be switched between logic 0 and logic 1 much quicker within a chip

than they can between chips. Communication within a chip can occur hundreds of

times faster than communication between chips on a printed circuit board. The high

speed of circuits on- chip is due to their small size-smaller components and wires have

smaller parasitic capacitances to slow down the signal.

Power consumption: Logic operations within a chip also take much less power. Once

again, lower power consumption is largely due to the small size of circuits on the chip-

smaller parasitic capacitances and resistances require less power to drive them

2.4 VLSI And Systems

These advantages of integrated circuits translate into advantages at the system

level:

Smaller physical size: Smallness is often an advantage in itself- consider portable

televisions or handheld cellular telephones.

Lower power consumption: Replacing a handful of standard parts with a single chip

reduces total power consumption. Reducing power consumption has a ripple effect on

the rest of the system: a smaller, cheaper power supply can be used; since less power

consumption means less heat, a fan may no longer be necessary; a simpler cabinet with

less shielding for electromagnetic shielding may be feasible, too.

Reduced cost: Reducing the number of components, the power supply requirements,

cabinet costs, and so on, will inevitably reduce system cost. The ripple effect of

integration is such that the cost of a system built from custom ICs can be less, even

though the individual ICs cost more than the standard parts they replace.

Communication within a chip can occur hundreds of times faster than communication

between chips on a printed circuit board.

Understanding why integrated circuit technology has such profound influence

on the design of digital systems requires understanding both the technology of IC

manufacturing and the economics of ICs and digital systems.

2.5 Applications of VLSI

Electronic systems now perform a wide variety of tasks in daily life. Electronic

systems in some cases have replaced mechanisms that operated mechanically,

hydraulically, or by other means; electronics are usually smaller, more flexible, and

easier to service. In other cases electronic systems have created totally new applications.

9

Electronic systems perform a variety of tasks, some of them visible, some more hidden.

Electronic systems in cars operate stereo systems and displays; they also control fuel

injection systems, adjust suspensions to varying terrain, and perform the control

functions required for anti-lock braking (ABS) systems.

• Digital electronics compress and decompress video, even at high-definition data

rates, on-the-fly in consumer electronics.

• Low-cost terminals for Web browsing still require sophisticated electronics,

despite their dedicated function.

• Personal computers and workstations provide word-processing, financial analysis,

and games. Computers include both central processing units (CPUs) and special-

purpose hardware for disk access, faster screen display, etc.

• Medical electronic systems measure bodily functions and perform complex

processing algorithms to warn about unusual conditions. The availability of these

complex systems, far from overwhelming consumers, only creates demand for

even more complex systems.

2.6 Conclusion

The growing sophistication of applications continually pushes the design and

manufacturing of integrated circuits and electronic systems to new levels of complexity.

And perhaps the most amazing characteristic of this collection of systems is its variety-

as systems become more complex, we build not a few general-purpose computers but

an ever wider range of special-purpose systems. Our ability to do so is a testament to

our growing mastery of both integrated circuit manufacturing and design, but the

increasing demands of customers continue to test the limits of design and

manufacturing.

10

CHAPTER 3

VERILOG HDL

3.1 Introduction

Verilog HDL is a hardware description language that can be used to model a

digital system at many levels of abstraction ranging from the algorithmic-level to the

gate-level to the switch-level. The complexity of the digital system being modeled

could vary from that of a simple gate to a complete electronic digital system, or

anything in between. The digital system can be described hierarchically and timing

can be explicitly modeled within the same description.

The Verilog HDL language includes capabilities to describe the behavior-al

nature of a design, the dataflow nature of a design, a design's structural composition,

delays and a waveform generation mechanism including aspects of response monitoring

and verification, all modeled using one single language. In addition, the language

provides a programming language interface through which the internals of a design can

be accessed during simulation including the control of a simulation run.

The language not only defines the syntax but also defines very clear simulation

semantics for each language construct. Therefore, models written in this language

can be verified using a Verilog simulator. The language inherits many of its operator

symbols and constructs from the C programming language. Verilog HDL provides an

extensive range of modeling capabilities, some of which are quite difficult to

comprehend initially. However, a core subset of the language is quite easy to learn and

use. This is sufficient to model most applications.

The Verilog HDL language was first developed by Gateway Design Automation

in 1983 as hardware are modeling language for their simulator product, At that time ,it

was a proprietary language. The Verilog HDL language includes capabilities to describe

the behavior-al nature of a design, the dataflow nature of a design, a design's structural

Because of the popularity of the, simulator product, Verilog HDL gained acceptance as a

usable and practical language by a number of designers. In an effort to increase the

popularity of the language, the language was placed in the public domain in 1990.

Open Verilog International (OVI) was formed to promote Verilog. In 1992 OVI

decided to pursue standardization of Verilog HDL as an IEEE standard. This effort was

successful and the language became an IEEE standard in 1995. The complete standard is

described in the Verilog hardware description language reference manual. The standard

is called std. 1364-1995.

11

3.2 Major Capabilities

Listed below are the major capabilities of the Verilog hardware description:

• Primitive logic gates, such as and, or and nand, are built-in into the language.

• Flexibility of creating a user-defined primitive (UDP). Such a primitive could

either be a combinational logic primitive or a sequential logic primitive.

• Switch-level modeling primitive gates, such as pmos and nmos, are also built- in

into the language.

• A design can be modeled in three different styles or in a mixed style. These

styles are: behavioral style modeled using procedural constructs; dataflow style

- modeled using continuous assignments; and structural style modeled using

gate and module instantiations.

• There are two data types in Verilog HDL; the net data type and the register

data type. The net type represents a physical connection between structural

elements while a register type represents an abstract data storage element.

• Figure.3-1 shows the mixed-level modeling capability of Verilog HDL, that is, in

one design; each module may be modeled at a different level.

Figure 3.1 Mixed level modeling

• Verilog HDL also has built-in logic functions such as & (bitwise-and) and I

(bitwise-or).

• High-level programming language constructs such as conditionals, case

statements, and loops are available in the language.

• Notion of concurrency and time can be explicitly modeled.

• Powerful file read and write capabilities fare provided.

• The language is non-deterministic under certain situations, that is, a model may

produce different results on different simulators; for example, the ordering of

events on an event queue is not defined by the standard.

12

3.3 SYNTHESIS

Synthesis is the process of constructing a gate level netlist from a register-

transfer level model of a circuit described in Verilog HDL. Figure.3-2 shows such a

process. A synthesis system may as an intermediate step, generate a netlist that is

comprised of register-transfer level blocks such as flip-flops, arithmetic-logic-units,

and multiplexers, interconnected by wires. In such a case, a second program called the

RTL module builder is necessary. The purpose of this builder is to build, or acquire

from a library of predefined components, each of the required RTL blocks in the user-

specified target technology.

Figure 3.2 Synthesis process

The above figure shows the basic elements of Verilog HDL and the elements

used in hardware. A mapping mechanism or a construction mechanism has to be

provided that translates the Verilog HDL elements into their corresponding hardware

elements as shown in figure.3-3

3.4 Conclusion

The Verilog HDL language includes capabilities to describe the behavior-al

nature of a design, the dataflow nature of a design, a design's structural composition,

delays and a waveform generation mechanism including aspects of response monitoring

and verification, all modeled using one single language. The language not only defines

the syntax but also defines very clear simulation semantics for each language construct.

Therefore, models written in this language can be verified using a Verilog simulator.

The Verilog HDL language includes capabilities to describe the behavior-al nature of

a design, the dataflow nature of a design, a design's structural composition, delays.

13

Figure 3.3: Typical design process

14

CHAPTER 4

BIT-SERIAL MULTIPLIER

4.1 Multiplier

Multipliers are key components of many high performance systems such as FIR

filters, microprocessors, digital signal processors, etc. A system’s performance is

generally determined by the performance of the multiplier because the multiplier is

generally the slowest clement in the system. Furthermore, it is generally the most area

consuming. Hence, optimizing the speed and area of the multiplier is a major design

issue. However, area and speed are usually conflicting constraints so that improving

speed results mostly in larger areas. As a result, whole spectrums of multipliers with

different area-speed constraints are designed with fully parallel processing. In between

are digit serial multipliers where single digits consisting of several bits are operated on.

These multipliers have moderate performance in both speed and area. However, existing

digit serial multipliers have been plagued by complicated switching systems and/or

irregularities in design. Radix 2^n multipliers which operate on digits in a parallel fashion

instead of bits bring the pipelining to the digit level and avoid most of the above

problems. They were introduced by M. K. Ibrahim in 1993. These structures are iterative

and modular. The pipelining done at the digit level brings the benefit of constant

operation speed irrespective of the size of’ the multiplier. The clock speed is only

determined by the digit size which is already fixed before the design is implemented.

The growing market for fast floating-point co-processors, digital signal processing

chips, and graphics processors has created a demand for high speed, area-efficient

multipliers. Current architectures range from small, low-performance shift and add

multipliers, to large, high performance array and tree multipliers. Conventional linear

array multipliers achieve high performance in a regular structure, but require large

amounts of silicon. Tree structures achieve even higher performance than linear arrays

but the tree interconnection is more complex and less regular, making them even larger

than linear arrays. Ideally, one would want the speed benefits of a tree structure, the

regularity of an array multiplier, and the small size of a shift and add multiplier.

4.2 Background

Webster’s dictionary defines multiplication as “a mathematical operation that at

its simplest is an abbreviated process of adding an integer to itself a specified number of

times”. A number (multiplicand) is added to itself a number of times as specified by

15

another number (multiplier) to form a result (product). In elementary school, students

learn to multiply by placing the multiplicand on top of the multiplier. The multiplicand is

then multiplied by each digit of the multiplier beginning with the rightmost, Least

Significant Digit (LSD). Intermediate results (partial-products) are placed one atop the

other, offset by one digit to align digits of the same weight. The final product is

determined by summation of all the partial-products. Although most people think of

multiplication only in base 10, this technique applies equally to any base, including

binary. Figure 1.2.1 shows the data flow for the basic multiplication technique just

described. Each black dot represents a single digit.

Figure 4.1: Basic Multiplication Data flow

4.2.1 Binary Multiplication

In the binary number system the digits, called bits, are limited to the set. The

result of multiplying any binary number by a single binary bit is either 0, or the original

number. This makes forming the intermediate partial-products simple and efficient.

Summing these partial- products is the time consuming task for binary multipliers. One

logical approach is to form the partial-products one at a time and sum them as they are

generated. Often implemented by software on processors that do not have a hardware

multiplier, this technique works fine, but is slow because at least one machine cycle is

required to sum each additional partial-product.

For applications where this approach does not provide enough performance,

multipliers can be implemented directly in hardware.

4.2.2 Hardware Multipliers

Direct hardware implementations of shift and add multipliers can increase

performance over software synthesis, but are still quite slow. The reason is that as each

additional partial- product is summed a carry must be propagated from the least

significant bit (LSB) to the most significant bit (MSB). This carry propagation is time

16

consuming, and must be repeated for each partial product to be summed.

One method to increase multiplier performance is by using encoding techniques to

reduce the number of partial products to be summed. Just such a technique was first

proposed by Booth. The original Booth’s algorithm ships over contiguous strings of l’s by

using the property that: 2” + 2(n-1) + 2(n-2) + . . . + 2hm) = 2(n+l) - 2(n-m). Although

Booth’s algorithm produces at most N/2 encoded partial products from an N bit operand,

the number of partial products produced varies. This has caused designers to use modified

versions of Booth’s algorithm for hardware multipliers. Modified 2-bit Booth encoding

halves the number of partial products to be summed.

Since the resulting encoded partial-products can then be summed using any

suitable method, modified 2 bit Booth encoding is used on most modern floating-point

chips LU 881, MCA 861. A few designers have even turned to modified 3 bit Booth

encoding, which reduces the number of partial products to be summed by a factor of three

IBEN 891. The problem with 3 bit encoding is that the

Carry-propagate addition required to form the 3X multiples often overshadows the

potential gains of 3 bit Booth encoding.

To achieve even higher performance advanced hardware multiplier architectures

search for faster and more efficient methods for summing the partial-products. Most

increase performance by eliminating the time consuming carry propagate additions. To

accomplish this, they sum the partial-products in a redundant number representation. The

advantage of a redundant representation is that two numbers, or partial-products, can be

added together without propagating a carry across the entire width of the number. Many

redundant number representations are possible. One commonly used representation is

known as carry-save form. In this redundant representation two bits, known as the carry

and sum, are used to represent each bit position. When two numbers in carry-save form

are added together any carries that result are never propagated more than one bit position.

This makes adding two numbers in carry-save form much faster than adding two normal

binary numbers where a carry may propagate. One common method that has been

developed for summing rows of partial products using a carry-save representation is the

array multiplier.

4.2.3 Array Multipliers

Conventional linear array multipliers consist of rows of carry-save adders (CSA).

A portion of an array multiplier with the associated routing can be seen in Figure 4.2.

17

Figure 4.2: Two Rows of an Array Multiplier

In a linear array multiplier, as the data propagates down through the array, each

row of CSA’s adds one additional partial-product to the partial sum. Since the

intermediate partial sum is kept in a redundant, carry-save form there is no carry

propagation. This means that the delay of an array multiplier is only dependent upon the

depth of the array, and is independent of the partial-product width. Linear array

multipliers are also regular, consisting of replicated rows of CSA’s. Their high

performance and regular structure have perpetuated the use of array multipliers for VLSI

math co-processors and special purpose DSP chips.

The biggest problem with full linear array multipliers is that they are very large.

As operand sizes increase, linear arrays grow in size at a rate equal to the square of the

operand size. This is because the number of rows in the array is equal to the length of the

multiplier, with the width of each row equal to the width of multiplicand. The large size

of full arrays typically prohibits their use, except for small operand sizes, or on special

purpose math chips where a major portion of the silicon area can be assigned to the

multiplier array.

Another problem with array multipliers is that the hardware is underutilized. As

the sum is propagated down through the array, each row of CSA’s computes a result only

once, when the active computation front passes that row. Thus, the hardware is doing

useful work only a very small percentage of the time. This low hardware utilization in

conventional linear array multipliers makes performance gains possible through increased

efficiency. For example, by overlapping calculations pipelining can achieve a large gain

in throughput Figure 4.3 shows a full array pipelined after each row of CSA’s. Once the

partial sum has passed the first row of CSA’s, represented by the shaded row of GSA’s in

18

cycle 1, a subsequent multiply can be started on the next cycle. In cycle 2, the first partial

sum has passed to the second row of CM’s, and the second multiply, represented by the

cross hatched row of CSA’s, has begun. Although pipelining a full array can greatly

increase throughput, both the size and latency are increased due to the additional latches

While high throughput is desirable, for general purpose computers size and latency tend

to be more important; thus, fully pipelined linear array multipliers are seldom found.

Figure 4.3: Data Flow through a Pipelined Array Multiplier

4.3 Variations in Multipliers

We do not always synthesize our multipliers from scratch but may desire, or be

required, to use building blocks such as adders, small multipliers, or lookup tables.

Furthermore, limited chip area and/or pin availability may dictate the use of bit-serial

designs. In this chapter, we discuss such variations and also deal with modular

multipliers, the special case of squaring, and multiply-accumulators.

• Divide-and-Conquer Designs

• Additive Multiply Modules

• Bit-Serial Multipliers

• Modular Multipliers

• The Special Case of Squaring

• Combined Multiply-Add Units

19

4.4 Bit-serial Multipliers

Bit-serial arithmetic is attractive in view of its smaller pin count, reduced wire

length, and lower floor space requirements in VLSI. In fact, the compactness of the

design may allow us to run a bit-serial multiplier at a clock rate high enough to make the

unit almost competitive with much more complex designs with regard to speed. In

addition, in certain application contexts inputs are supplied bit-serially anyway. In such a

case, using a parallel multiplier would be quite wasteful, since the parallelism may not

lead to any speed benefit. Furthermore, in applications that call for a large number of

independent multiplications, multiple bit-serial multipliers may be more cost-effective

than a complex highly pipelined unit.

Figure 4.4: Bit-serial multiplier; 4x4 multiplication in 8 clock cycles

Bit-serial multipliers can be designed as systolic arrays: synchronous arrays of

processing elements that are interconnected by only short, local wires thus allowing very

high clock rates. Let us begin by introducing a semisystolic multiplier, so named because

its design involves broadcasting a single bit of the multiplier x to a number of circuit

elements, thus violating the “short, local wires” requirement of pure systolic design.

Figure 4.4 shows a semisystolic 4 x 4 multiplier. The multiplicand a is supplied in

parallel from above and the multiplier x is supplied bit-serially from the right, with its

least significant bit arriving first. Each bit x i of the multiplier is multiplied by a and the

20

result added to the cumulative partial product, kept in carry-save form in the carry and

sum latches. The carry bit stays in its current position, while the sum bit is passed on to

the neighboring cell on the right. This corresponds to shifting the partial product to the

right before the next addition step (normally the sum bit would stay put and the carry bit

would be shifted to the left). Bits of the result emerge serially from the right as they

become available.

A k-bit unsigned multiplier x must be padded with k zeros to allow the carries to

propagate to the output, yielding the correct 2k-bit product. Thus, the semisystolic

multiplier of Figure 4.4 can perform one k x k unsigned integer multiplication every 2k

clock cycles. If k-bit fractions need to be multiplied, the first k output bits are discarded

or used to properly round the most significant k bits.

To make the multiplier of Figure 4.4 fully systolic, we must remove the

broadcasting of the multiplier bits. This can be accomplished by a process known as

systolic retiming, which is briefly explained below

Consider a synchronous (clocked) circuit, with each line between two functional

parts having an integral number of unit delays (possibly 0). Then, if we cut the circuit into

two parts CL and CR, we can delay (advance) all the signals going in one direction and

advance (delay) the ones going in the opposite direction by the same amount without

affecting the correct functioning or external timing relations of the circuit. Of course, the

primary inputs and outputs to the two parts CL and cg must be correspondingly advanced

or delayed, too.

For the retiming to be possible, all the signals that are advanced by d must have

had original delays of d or more (negative delays are not allowed). Note that all the

signals going into CL have been delayed by d time units. Thus, CL will work as before,

except that everything, including output production, occurs d time units later than before

retiming. Advancing the outputs by d time units will keep the external view of the circuit

unchanged.

We apply the preceding process to the multiplier circuit of Figure 4.4 in three

successive steps corresponding to cuts 1, 2, and 3, each time delaying the left-moving

signal by one unit and advancing the right-moving signal by one unit. Verifying that the

multiplier in Fig. 12.9 works correctly is left as an exercise. This new version of our

multiplier does not have the fan-out problem of the design in Figure 4.4 but it suffers

from long signal propagation delay through the four FAs in each clock cycle, leading to

inferior operating speed. Note that the culprits are zero-delay lines that lead to signal

propagation through multiple circuit elements.

21

One way of avoiding zero-delay lines in our design is to begin by doubling all the

delays in Figure 4.4. This is done by simply replacing each of the sum and carry flip-flops

with two cascaded flip-flops before retiming is applied. Since the circuit is now operating

at half its original speed, the multiplier x must also be applied on alternate clock cycles.

The resulting design is fully systolic, inasmuch as signals move only between adjacent

cells in each clock cycle. However, twice as many cycles are needed.

The easiest way to derive a multiplier with both inputs entering bit-serially is to

allow k clock ticks for the multiplicand bits to be put into place in a shift register and then

use the design of Figure 4.4 to compute the product. This increases the total delay by k

cycles.

Figure 4.5 uses dot notation to show the justification for the bit-serial multiplier

design above. Figure 4.5 depicts the meanings of the various partial operands and results.

Figure 4.5: Bit Serial multiplier design in dot notation

22

CHAPTER 5

IMPLEMENTATION

5.1 Tools Used

1) Pc installed with linux operating system

2) Installed cadence tools:

• Ncvlog – For checking errors

• Ncverilog – For execution of code

• Simvision – To View waveforms

5.2 Coding Steps

1) Create directory structure for the project as below

Figure 5.1: Project directory structure

2) Write RTL code in a text file and save it as .v extension in RTL directory

3) Write code for testbench and store in TB directory

5.3 Simulation steps

The Commands that are used in cadence for the execution are

1) Initially we should mount the server using “mount -a”.

2) Go to the C environment with the command “csh” //c shell.

3) The source file should be opened by the command “source /root/cshrc”.

4) The next command is to go to the directory of cadence_dgital_labs

#cd .../../cadence_digital_labs/

5) Then check the file for errors by the command “ncvlog ../rtl/filename.v -mess”.

6) Then execute the file using “ncverilog +access +rwc ../rtl/filename.v ../tb/file_tb.v

+nctimescale +1ns/1ps

Rwc –read write command Gui- graphical unit interface

7) After running the program we open simulation window by command “simvision

&".

23

Figure 5.2: Simulation window

8) After the simulation the waveforms are shown in the other window.

Figure 5.3: Waveform window

5.4 Full adder code

module fulladder(output reg cout,sum,input a,b,cin,rst);

always@(a,b,cin)

{cout,sum}=a+b+cin;

always@(posedge rst)

begin

sum<=0;

cout<=0;

end

endmodule

24

5.5 Full adder flowchart

Figure 5.4: Full adder flowchart

5.6 Full adder testbench

module full_adder_tb;

wire cout,sum;

reg a,b,cin,rst;

//dut

fulladder fa(cout,sum,a,b,cin,rst);

initial

begin

#2 rst=1'b1;

#(period/2) rst=1'b0;

a=1'b1;

b=1'b0;

cin=1'b1;

#5 a=1'b0;

b=1'b1;

cin=1'b1;

$finish;

end

endmodule

25

5.7 Bit-serial multiplier algorithm

Figure 5.5: Bit-Serial multiplier flowchart

5.8 Bit-Serial multiplier code

module serial_mult(output product,input [3:0] a,input b,clk,rst);

wire s1,s2,s3;

reg s1o,s2o,s3o; //latches for sum at various stages

wire c0,c1,c2,c3;

reg c0o,c1o,c2o,c3o;//latches for carry at various stages

wire a3o,a2o,a1o,a0o;

reg s;

fulladder fa0(c0,product,a0o,s1o,c0o,rst);

fulladder fa1(c1,s1,a1o,s2o,c1o,rst);

fulladder fa2(c2,s2,a2o,s3o,c2o,rst);

fulladder fa3(c3,s3,a3o,s,c3o,rst);

and n0(a0o,a[0],b);

and n1(a1o,a[1],b);

and n2(a2o,a[2],b);

and n3(a3o,a[3],b);

always@(posedge clk, posedge rst)

begin

26

if(rst)

begin

s=0;

c0o<=1'b0;

c1o<=1'b0;

c2o<=1'b0;

c3o<=1'b0;

s1o<=1'b0;

s2o<=1'b0;

s3o<=1'b0;

end

else //moving all sums to reg

begin

c0o<=c0;

c1o<=c1;

c2o<=c2;

c3o<=c3;

s1o<=s1;

s2o<=s2;

s3o<=s3;

end

end

endmodule

5.9 Full adder waveform

Figure 5.6: Full adder output waveforms

5.10 Bit-serial multiplier testbench

module serial_mult_tb;

reg [3:0] a;

reg b;

wire product;

reg clk,rst;

parameter period=10;

serial_mult dut(product,a,b,clk,rst); //dut

//clock

27

initial clk=0;

always #period clk=~clk;

initial

begin

#2 rst=1'b1;

#(period/2) rst=1'b0;

a=4'b1101;

b=1;

@(posedge clk) b=0;

@(posedge clk) b=0;

@(posedge clk) b=1;

@(posedge clk) b=0;

@(posedge clk) b=0;

@(posedge clk) b=0;

@(posedge clk) b=0;

#period $finish;

end

endmodule

5.11 Bit-serial multiplier waveforms

Figure 5.7: Bit serial multiplier input/output waveforms

Figure 5.8: Bit serial multiplier with intermediate waveforms

28

CHAPTER 6

CONCLUSIONS

Multipliers play an important role in today’s digital signal processing and various

other applications. With advances in technology, many researchers have tried and are

trying to design multipliers which offer either of the following design targets – high

speed, low power consumption, regularity of layout and hence less area or even

combination of them in one multiplier thus making them suitable for various high speed,

low power and compact VLSI implementation. The common multiplication method is

“add and shift” algorithm. In parallel multipliers number of partial products to be added is

the main parameter that determines the performance of the multiplier. To reduce the

number of partial products to be added, Modified Booth algorithm is one of the most

popular algorithms. To achieve speed improvements Wallace Tree algorithm can be used

to reduce the number of sequential adding stages. Further by combining both Modified

Booth algorithm and Wallace Tree technique we can see advantage of both algorithms in

one multiplier. However with increasing parallelism, the amount of shifts between the

partial products and intermediate sums to be added will increase which may result in

reduced speed, increase in silicon area due to irregularity of structure and also increased

power consumption due to increase in interconnect resulting from complex routing. On

the other hand “serial-parallel” multipliers compromise speed to achieve better

performance for area and power consumption. The selection of a parallel or serial

multiplier actually depends on the nature of application.

A key challenge facing current and future computer designers is to reverse the

trend by removing layer after layer of complexity, opting instead for clean, robust, and

easily certifiable designs, while continuing to try to devise novel methods for gaining

performance and ease-of-use benefits from simpler circuits that can be readily adapted to

application requirements.

This is achieved by using Bit Serial multipliers.

29

REFERENCES

[1] Behrooz Parhami, Computer arithmetic: algorithms and hardware designs, Oxford

University Press, 2009

[2] F. Sadiq M. Sait, Gerhard Beckoff, “A Novel Technique for Fast Multiplication”.

IEEE Fourteenth Annual International Phoenix Conference on Computers and

Communications, vol. 7803-2492-7, pp. 109-114, 1995.

[3] Ghest, C., Multiplying Made Easy for Digital Assemblies, Electronics, Vol. 44,

pp.56-61. November 22. 1971.

[4] Ienne, P., and M. A. Viredaz, “Bit-Seria1 Multipliers and Squarers,” IEEE Trans.

Computers, Vol. 43, No. 12, pp. 1445-1450, 1994

[5] Samir Palnitkar, Verilog HDL: A Guide to Digital Design and Synthesis, Prentice

Hall Professional, 2003