Digital Systems Design 2

Digital Systems Design 2

Programmable Logic and Storage Devices Chapter 8: “Advanced Digital Design with the Verilog HDL”, Michael D. Ciletti.Memory, CPLDs and FPGAsChapter 10: “Digital Design Principles and Practices”, John F. Wakerly, Prentice Hall, 2001, Third Edition

Programmable Logic and Storage Devices With advancement of hardware technology:

Density Complexity SizeOf field-programmable gate arrays (FPGAs), it provides an

attractive and cost-efficient alternative to semi-custom application specific integrated circuits (ASICs).

The opportunity to realize large circuits in FPGAs has created pressure for a change in the method by which circuits are designed for FPGA-based applications: Schematic entry tools can be productive and efficient when

designs are small. Trend is toward larger and larger designs targeted for FPGAs.

Thus, language-based design methodology has become essential to FPGA-based design flows.

Programmable Logic and Storage Devices Technologies available for implementing digital circuits range from:

Standard Integrated Circuits (ICs) used in low-density/low-performance applications,

To Cell-based and full-custom ICs for high-density/high-performance circuits. Standard ICs:

Can be manufactured cheaply, Implement very limited, basic functionality at low levels of integration.

Customized ICs Implement specialized functionality with a high level of integration Have a small market Creates inventory risk because the quantities that could be sold do not warrant the

expense of their development and production. Programmable Logic Devices:

Between two extremes of density and performance that characterize standard parts and full-custom circuits.

Born out of necessity created by two conflicting realities: Large, dense, high-performance circuits cannot be build practically or economically

from discrete devices Dedicated ICS cannot be produces economically to satisfy a diversity of low-volume

applciations.

Programmable Logic and Storage Devices

PLDs: Read-Only Memory’s (ROM) Programmable Logic Arrays (PLA) Programmable Array Logic (PAL) Complex Programmable Logic Devices

(CPLD) Field Programmable Gate Arrays (FPGA),

and Mask-Programmable Gate Arrays

(MPGA).

Programmable Logic Devices For most up-to-date PLDs see:

www.e-insite.net/ednmag PLDs have

a fixed architecture Functionality is programmed for a specific application Programming is done by:

Manufacturer - mask-programmable logic devices (MPLD)

End-User – field-programmable logic devices (FPLD). Three basic characteristics distinguish PLDs from each

other:1. An architecture of identical basic functional units2. A programmable interconnection fabric, and3. A programming technology.

Programmable Logic Devices The first type of PLDs considered has the AND-OR plane structure shown in

the figure. This type of architecture is used to implement ROMs, PLAs, and PALs. It implements Boolean expressions in Sum of Products (SOP) form:

AND plane forms product terms selectively from the inputs, and OR plane forms outputs from sums of selected product terms.

A programmable interconnect fabric joins the two planes, so that the outputs implement sum-of-product expressions of the inputs.

Whether and how a plane can be programmed determines the particular type of PLD that is implemented by the overall structure.

AND Plane OR PlaneInputs Outputs

Product Terms

AND-OR plane structure of a programmable logic device

Storage Devices The architecture used to implement PLDs lends itself

to implementation of storage devices. Storage Devices can be:

Read-Only, or Random Access

depending on whether the contents of a memory cell can be written during normal operation of the device.

ROM (read-only memory) is a device programmed to hold certain contents, which remain unchanged during operation and after power is removed from the device.

RAM (random-access memory) in contrast its contents can be changed during operation, and they vanish when the power is removed.

Read-Only Memory (ROM) Read-Only Memory

(ROM) A 2n x b ROM consists of

an addressable array of semiconductor memory cells organized as 2n words of b bits each.

ROM Interface: n – inputs defining

address lines. b – outputs called bit

lines. ROM is non-volatile

memory. It’s content is preserved even if no power is applied.

Read-Only Memory (ROM)

AND-OR planes for a ROM:

AddressDecoder

(Nonprogrammable)AND Plane

OR PlaneMemory Array

2n x b

b – Outputs(bit lines)

2n Minterms (Word lines) formed from inputs

A(0)A(1)

A(i)

A(n-1)

D(b-1) D(i) D(0)

Using ROM for “Random” Combinational Functions ROM can be used to perform any combinational function. ROM will

actually store for each input bit-pattern (input address) the corresponding output bit-pattern.

Example: Truth table for a 3-input, 4-output combinational logic function.

Inputs Outputs

A2 A1 A0 D3 D2 D1 D0

0 0 0 1 1 1 0

0 0 1 1 1 0 1

0 1 0 1 0 1 1

0 1 1 0 1 1 1

1 0 0 0 0 0 1

1 0 1 0 0 1 0

1 1 0 0 1 0 0

1 1 1 1 0 0 0

Equivalent 2-to-4 decoder with output-polarity control

Using ROM for “Random” Combinational Functions Another example that can be built with ROM is

unsigned binary multiplication. Typical realization of a 4x4 multiplier requires to high

number of product terms (36) to obtain one pass multiplier through a conventional PLD’s AND-OR array.

With ROM one can realize the function with one pass through a 28 x 8 (256X8) ROM.

Contents of a ROM are normally specified by a file that contains one entry for every address in the ROM.

The nice think about ROM-based design is that one can usually write a simple program in a high-level language to calculate what should be stored in the ROM.

Two-dimensional decoding Suppose that one wants to build a 128 x 1 ROM.

Straight forward solution will require a 7-to-128 decoder: 128 7-input NAND gates, 14 buffers and inverters with a fanout of 64 each.

ROMs with a 1M bits or more are available commercially and they do not use linear structure for decoder – which would require a 20-to-1,048,576 decoders.

The structure used is called two-dimensional decoding. This structure enables reduction of the decoder size to something on the order of the square root of the number of addresses. The basic idea in two-dimensional decoding is to arrange the

ROM cells in an array that is as close as possible to square. In the next illustration a possible internal structure for a 128x1

ROM is depicted.

Two-dimensional decoding

Two-dimensional decoding As can be seen, two-dimensional decoding allows a 128x1

ROM to be built with a 3-to-8 decoder and a 16-input multiplexer (whose complexity is comparable to that of a 4-to-16 decoder).

A 1Mx1 Rom could be built with a 10-to-1024 decoder and 1024-input multiplexer. A lot simpler than the one dimensional alternative.

Additional benefit to reduction of decoding complexity is that two-dimensional decoding has one other benefit –- it leads to a chip whose physical dimensions are close to square -- important for chip fabrication and packaging.

In ROMs with multiple data outputs the storage arrays corresponding to each data output may be made narrower in order to achieve an overall chip layout that is closer to square. For example, the next figure shows the possible layout of a 32K x 8 ROM.

Possible layout of a 32K x 8 ROM

Commercial ROM Types A modern ROM is fabricated as a single IC chip; one that

stores 4M bits with a price under $5. Various methods are used to “program” the information

stored in a ROM:

Mask Programmable ROMs. Manufacturer has to be provided with the ROM content in order

to create one or more customized masks to manufacture ROMs with the required pattern.

ROM manufacturers impose a mask charge of several thousand dollars for the customized aspects of mask-ROM production. Because of mask charges and the four-week delay typically required to obtain programmed chips, mask ROMs are normally used today only in very high-volume applications.

For a low-volume applications there are more cost-effective choices, discussed next.

Commercial ROM Types Programmable read-only memory (PROM)

Similar to a mask ROM, except that the customer may store data values (program the PROM) in just a few minutes.

PROM is manufactured with all of its diodes or transistors “connected”. This corresponds to having all desired bits at a particular value (typically “1”). The PROM programmer can be used to set desired bits to the opposite value.

In bipolar PROMs this is done by vaporizing tiny fusible links inside the PROM corresponding to each bit.

A link is vaporized by selecting it using the PROM’s address and data lines, and then applying a high-voltage pulse (10-30V) to the device through a special input pin.

Early reliability problems with vaporized links technology were solved and reliable fusible-link technology is used now days not only in bipolar PROMs but also in the bipolar PLD circuits.

Commercial ROM Types Erasable programmable read-only memory (EPROM):

EPROM is programmable just like PROM. In addition it also can be “erased” to all 1s-state by exposing it

to ultra-violet light. EPROM uses a different technology called “floating-gate MOS”. EPROM manufacturers guarantee that a properly programmed

bit will retain 70% of its charge for at least 10 years even if the part is stored at 125o C.

Insulating material surrounding the floating gate becomes slightly conductive if it is exposed to ultraviolet light with a certain wavelength which provides for the EPROM content to be erased.

Most common application of EPROMs is to store programs in microprocessor systems.

EPROMs are typically used during development. ROMs and PROMs are used once the program is finalized because usually they cost less than EPROMs of similar capacity.

Commercial ROM Types Electrically Erasable Programmable Read-Only

Memory (EEPROM). It is like and EPROM except that individual stored bits

may be erased electrically. Floating gates in an EEPROM are surrounded by a

much thinner insulating layer and can be erased by applying a voltage of the opposite polarity as the charging voltage to the non-floating gate.

Large EEPROMs (1M bit and larger) allow erasing only in fixed-size blocks, typically 128-512 Kbits (16-64 Kbytes) at a time. These memories are typically called flash EPROMs or flash memories.

EEPROM can be reprogrammed only a limited number of times (Insulating layer wares off).

Logic Symbols for standard EPROMs in 28-pin dual in-line packages.

ROM Applications In addition to the most common application of ROMs

for program storage in microprocessor systems, there are many other applications that can provide a low-cost realization of a complex or “random” combinational logic function.

Example of Voice Signals: When an analog voice signal enters a typical

telephone systems, it is sampled 8,000 times per second and converted into a sequence of 8-bit bytes representing the analog signal at each sampling point.

This example will show how ROM-based circuits can easily deal with this highly encoded information.

Coding Voice Samples The simplest 8-bit encoding of the sign and

amplitude of an analog signal would be an 8-bit integer in the two’s complement or signed-magnitude system.

8-bit linear encoding yields a dynamic range of only 28 = 256 different values.

This corresponds to a dynamic range in signal power of 20*log(256)≈48dB.

By comparison, compact audio disks use a 16-bit linear encoding with a theoretical dynamic range of 20*log(216)≈96dB

Coding of Voice Samples North American telephone network uses an 8-bit compounded encoding

called μ–law PCM (pulse code modulation). The next figure shows the format of an 8-bit coded byte: a sort of floating

point representation containing sign (S), exponent (E) and mantissa (M) fields.

The analog value V represented by a byte in this format is given by the formula: V = (1-2s)*[(2E)*(2M+33)-33]

An analog signal represented in this format can range from -8159*k to +8159*k, where k is arbitrary scale factor.

The range of the signals is 2*8159 and the smallest difference that can be represented is only 2 (when E=0), so the dynamic range is 20*log(8159) ≈78dB.

sign exponent mantissa

7 6 5 4 3 2 1 0S E M

Coding of Voice Samples In many types of phone connections voice signal is

purposely attenuated by a few decibels to make things work better.

Given a μ–law PCM byte, a digital attenuator must produce a different PCM byte that represents the original analog signal multiplied by a specified attenuation factor.

One way to build a digital attenuator is shown in the next figure.

Each block in the figure can be build with perhaps a dozen MSI chips or a CPLD or FPGA

μ-law to linear

decoder

8

14x14

multiplier

14

14linear to

μ-lawencoder

814

Coding of Voice Samples Alternative realization of digital

attenuator can be done using a single inexpensive 8kx8 ROM instead.

This ROM can apply any of 32 different attenuation factors to a μ–law input byte.

High order-address bits select a table, and the low order address bits select an entry.

Digital Conference Circuit In the analog telephone network, it is easy to make a conference

connection between three or more parties: Just connect the analog phone wires together and you get an analog

summing junction. In the digital network, digital conference circuit must include a

digital adder that produces output samples corresponding to the sums of the input samples.

We have seen how to create binary adders for 8-bit operands. However, binary adders cannot process μ–law PCM bytes directly. The 8-bit μ–law PCM bytes must be converted to 14-bit linear format, The signals then can be added, Resulting signal must then be converted to 8-bit μ–law PCM as in

previous example. Again, one could create a complex adder or alternatively the same

function be performed by a single 64K x 8 ROM. The ROM has 16 address inputs accommodating two 8-bit μ–law PCM

operands. For each pair of operand values, the corresponding ROM address

contains the pre-computed 8-bit μ–law PCM sum.

ROM-based Designs (Advantages) Previous two examples illustrate many advantages of building complex

combinational functions with ROMs. Most complex functions:

Are generally difficult to design with a custom digital logic ROM realization of those functions is alternatively straight forward.

For a moderately complex function, a ROM-based circuit is usually faster than a circuit using multiple SSI/MSI devices and PLDs, and often faster than an FPGA or custom LSI chip in a comparable technology.

The program that generates the ROM contents can easily be structured to handle unusual or undefined cases that would require additional hardware in any other designs. For example adder function of the previous example can easily handle out-of-range sums.

A ROM’s function is easily modified just by changing the stored pattern, usually without changing any external connections. For example, the PCM attenuator and adder ROM’s in the previous example can be changed to use 8-bit A–law PCM, the standard digital voice coding in Europe.

The prices of ROMs and other structured logic devices are always dropping, making them more economical and their densities are always increasing, expanding the scope of problems that can be solved with a single chip.

ROM-based Designs (Disadvantages) For a simple to moderately complex

functions, a ROM-based circuit may cost more, consume more power, or run slower

then a circuit using a few SSI/MSI devices and PLDs or small FPGA.

For functions more than 20 inputs, a ROM-based circuit is impractical because of the limit on ROM sizes that are available. For example, one wouldn’t build a 16-bit adder in ROM – it would require billions and billions of bits.

Complex Programmable Logic Devices Since their inception years ago, programmable logic

devices have been very flexible workhorses of digital design.

As IC technology advanced, there was naturally great interest in creating larger PLD architectures to take advantage of increased chip density. The question is why didn’t manufacturers just scale the existing architectures?

For example, if DRAM densities increased by a factor of 64 over the last 10 years, why couldn't manufactures scale the 16V8 (16 input signals and its complements, and a number of 16-variable product terms) to create a “128V64”? Such device would have 64 I/O pins, and some number (say 8) of 128-variable product terms for each of its 128 logic macro-cells.

Complex Programmable Logic Devices

This new chip “128V64” could combine the functions of a larger collection of 16V8 and offer terrific performance and flexibility using any input in any output function?

This new chip would be very flexible but it would not have a good performance.

How to expand PLD architecture? Increase # of inputs and outputs in a conventional

PLD? E.g., 16V8 --> 20V8 --> 22V10. Why not --> 32V16 --> 128V64 ?

Problems: n times the number of inputs and outputs requires n2

as much chip area -- too costly logic gets slower as number of inputs to AND array

increases Solution: multiple PLDs with a relatively small

programmable interconnect. Less general than a single large PLD, but can use

software “fitter” to partition into smaller PLD blocks.

CPLDs vs. FPGAs

CPLD architecture:

Small number of largish PLDs (e.g., “36V18”) on a single chip

Programmable interconnect between PLDs

CPLDs vs. FPGAs

FPGAarchitecture

Much larger number of smaller programmable logic blocks.

Embedded in a sea of lots and lots of programmable interconnect.

CPLD families Identical individual PLD blocks (Xilinx “FBs”)

replicated in different family members. Different number of PLD blocks Different number of I/O pins

Many CPLDs have fewer I/O pins than macrocells “Buried” Macrocells -- provide needed logic terms

internally but these outputs are not connected externally.

IC package size dictates # of I/O pins but not the total # of macrocells.

Typical CPLD families have devices with differing resources in the same IC package.

Xilinx XC9500 CPLD Family The xilinx XC9500 series is a family of CPLDs with a

similar architecture but varying number of external input/output pins and internal PLDs (which Xilinx calls function blocks – FBs).

Each internal PLD has 36 inputs and 18 macrocells and outputs and might be called “36V18”.

As shown in the table in the next slide, devices in the family are named according to the number of macrocells they contain.

The smallest has 2 FBs and 36 macrocells, and The largest has 16 FBs and 288 macrocells.

Xilinx CPLDs Notice overlap in resource availability in a particular package.

Xilinx CPLDs Another feature of this family is that a given chip, such as

XCC95108 is available in several different packages. This is important not only to accommodate different manufacturing practices but also to provide some choice and potential savings in the number of external I/O pins provided. In most applications, it is not necessary for all internal signal of a state machine or subsystem to be visible to and used by the rest of the system.

Thus, even though the XC95108 has 108 macrocells, the outputs of at most 69 of them can be connected externally in the 84-pin PLCC version of the device. In fact many of the 69 I/O pins would typically be used for inputs, in

which case even fewer outputs would be visible externally. Note that the remaining macrocell outputs are still quite usable

internally, since they can be hooked up internally through the CPLD’s programmable interconnect.

Macrocells whose outputs are usable only internally are sometimes called buried macrocells.

Xilinx 9500-family CPLD architecture

Xilinx 9500-family CPLD architecture I/O pins can be used as input, output or bidirectional

pins according to the device’s programming. Special purpose pins:

GSK – global clock GSR – global set/reset GTS – global three-state controls;

one of these signals can be selected in each macrocell to output enable the corresponding output driver when the macrocell’s output is hooked up to an external I/O pin.

Only 4 FB’s are shown in the previous schematic diagram, however, XC9500 architecture scales to accommodate 16 Fbs in th XC95288.

Xilinx 9500-family CPLD architecture Regardless of the specific family member, each FB

programmably receives 36 signals from the switch matrix.

The inputs to the switch matrix are the 18 macrocell outputs from each of the FBs and the external inputs from the I/O pins.

Each FB also has 18 outputs that run “under” the switch matrix as shown in the previous figure connecting to the I/O blocks. These are merely the output-enable signals for the

I/O block output drivers; They are used when the FB macrocell’s output is

hooked up to an external I/O pin.

9500-family function blocks (FBs) architecture

18 macrocells per FB 36 inputs per FB (partitioning challenge, but also

reason for relatively compact size of FBs) Macrocell outputs can go to I/O cells or back into

switch matrix to be routed to this or other FBs.

9500-family function blocks (FBs) architecture

The basic XC9500 FB programmable AND array has just 90 product terms.

However, it also has product-term allocation. This mechanism allows a macrocell’s

unused product terms to be used by other nearby macrocells in the same FB.

Next slide depicts a logic diagram of the XC9500 product-term allocator and macrocell.

9500-series macrocell (18 per FB)

Up to 5 product terms

Programmableinversion or XORproduct term

Global clock or product-term clock

Set control

Reset control

OE control

9500-series product-term allocator

programmablesteeringelements

Share terms from above and below

9500-series I/O block Analog controls in addition to

logic ones:1. Slew-rate control. The rise and

fall time of the output signals - can be set to be fast or slow.

2. Pull-up resistor. When enabled, pull-up resistor prevents output pins from floating as the CPLD is powered up. Useful if the outputs are used to drive active-low enable inputs of other logic that is not supposed to be enabled during power up.

3. User-programmable ground. This feature reallocated an I/O pin be ground pin and not a signal pin. Extra ground pins are needed to handle the high dynamic currents that flow when multiple outputs switch simultaneously.

Switch matrix for XC95108 Could be anything from a

limited set of multiplexers to a full crossbar.

Multiplexer -- small, fast, but difficult fitting

Crossbar -- easy fitting but large and slow

Finding a complete set of connections through a sparse switch matrix is NP-complete problem.

For each different CPLD-based design, a set of switch-matrix connections must be found be “fitter” software.

Typically this software together with overall CPLD design are part of manufacturers “secret sauce”

FPGAs Historically, FPGA architectures and companies began around the

same time as CPLDs. Xilinx launched the world’s first commercial FPGA in 1985, with the

vintage XC2000 device family. XC3000 and XC4000 families soon followed, setting the stage for

today’s Spartan and Virtex device families. Each evolution of devices brought improvements in density,

performance, voltage levels, pin counts, and functionality. Thus XC4000, Spartan and Spartan/XL devices have the same basic

architecture. FPGAs are closer to “programmable ASICs” -- large emphasis on

interconnection routing Timing is difficult to predict -- multiple hops vs. the fixed delay of a

CPLD’s switch matrix. But more “scalable” to large sizes.

FPGA programmable logic blocks have only a few inputs and 1 or 2 flip-flops, but there are a lot more of them compared to the number of macrocells in a CPLD.

General FPGA chip architecture

a.k.a. CLB --“configurable logicblock”

Xilinx 4000-series FPGAs

FPGA specsmanship

Two flip-flops per CLB, plus two per I/O cell.

25 “gates” per CLB if used for logic. 32 bits of RAM per CLB if not used for

logic. All of this is valid only if your design

has a “perfect fit”.

Configurable Logic Block (CLB)

CLB function generators (F, G, H) Use RAM to store a truth table

F, G: 4 inputs, 16 bits of RAM each H: 3 inputs, 8 bits of RAM:

16x2 dual port RAM or 32X1 single port RAM.

RAM is loaded from an external PROM at system initialization.

Broad capability using F, G, and H: Any 2 funcs of 4 vars, plus a func of 3 vars Any func of 5 vars Any func of 4 vars, plus some funcs of 6 vars Some funcs of 9 vars, including parity and 4-bit

cascadable equality checking

Dedicated Fast Carry and Borrow Logic The F and G function generators of the XC4000 family have:

separate dedicated logic for fast carry and borrow generation, with dedicated routing to link the extra signal to the function

generator in the adjacent CLB. One function generator (F) can be used to add a0+b0, and Second function generator (G) can generate a1+b1. The fast carry will forward the carry to the next CLB above

or below.

Fast carry and borrow logic increases the efficiency performance of adders, subtractors, accumulators, comparators, and counters.

CLB input and output connections -- buried in the sea of interconnect

XC4000 Interconnect Resources Three types of general-purpose interconnect:

1. Single-length lines,2. Double-length lines, and3. Long lines

A grid of horizontal and vertical single-length lines connect an array of switch boxes.

Switch boxes provide a reduced number of connections between signal paths within each box (not a full crossbar switch).

XC4000 Interconnect Resources In the XC4000 there is a rich set of connections between single-

length lines and the CLB inputs and outputs. Capabilities for nearest-neighbor and across-the-chip connection

between CLBs. Two “single” groups are optimized for flexible connectivity between

adjacent blocks without the small number of unidirectional limitation of wires in the “Direct Connect” groups.

With “single” wires it is possible to connect a CLB to another that’s more than one hop away, but they would have to go through a programmable switch for each hop which adds delay.

Wires in the “Double” groups travel past two CLBs before hitting a switch, so they provide shorter delays for longer connections.

The “Long” groups of wires do not go through any programmable switches at all: instead, they travel all the way across or down a row or column and are driven by three-state drivers near the CLB.

Detail connections controlled byRAM bits

Programmable Switch Matrix (PSM) Each diamond in the shaded area indicating PSM:

Is a programmable switch element (SPE) that can connect any line to any other as shown in the next slide under (b).

While the PSM is essential, using it has a price – signals incur a small delay each time they hop through a PSE.

High-quality FPGA fitter software searches for not just any CLB placement and wire connections that work.

The “placement and routing” tool spends a lot of time trying to optimize device performance by finding a placement that allows short connections, and then routing the connections themselves.

Programmable Switch Matrixprogrammable switch element

turning the corner, etc.

The fitter’s job Partition logic functions into CLBs Arrange the CLBs Interconnect the CLBs Minimize the number of CLBs used Minimize the size and delay of interconnect

used Work with constraints

“Locked” I/O pins Critical-path delays Setup and hold times of storage elements

I/O blocks

Problems common to CPLDs and FPGAs Pin locking

Small changes, and certainly large ones, can cause the fitter to pick a different allocation of I/O blocks and pinout.

Locking too early may make the resulting circuit slower or not fit at all.

Running out of resources Design may “blow up” if it doesn’t all fit on a

single device. On-chip interconnect resources are much richer

than off-chip; e.g., barrel-shifter example. Larger devices are exponentially more

expensive.

Documents

Digital Systems Design 2