Optimization and Modeling of FPGA Circuitry in …...4.4 Switch block multiplexer transistor sizes for PT and TG implementations for di erent levels of gate boosting (see Figure 4.2

Optimization and Modeling of FPGA Circuitry in Advanced ProcessTechnology

by

Charles Chiasson

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

© Copyright 2013 by Charles Chiasson

Abstract

Optimization and Modeling of FPGA Circuitry in Advanced Process Technology

Charles Chiasson

Master of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

2013

We develop a new fully-automated transistor sizing tool for FPGAs that features area, delay and wire

load modeling enhancements over prior work to improve its accuracy in advanced process nodes. We then

use this tool to investigate a number of FPGA circuit design related questions in a 22nm process. We

find that building FPGAs out of transmission gates instead of the currently dominant pass-transistors,

whose performance and reliability are degrading with technology scaling, yields FPGAs that are 15%

larger but are 10-25% faster depending on the allowable level of “gate boosting”. We also show that

transmission gate FPGAs with a separate power supply for their gate terminal enable a low-voltage

FPGA with 50% less power and good delay. Finally, we show that, at a possible cost in routability,

restricting the portion of a routing channel that can be accessed by a logic block input can improve delay

by 17%.

ii

Acknowledgements

First, I would like to express my sincerest gratitude to my supervisor Vaughn Betz for his guidance

and motivation, for his technical help and for the tidbits of wisdom that he shared with me, knowingly

or unknowingly, over the past two years. I learned so much in so little time and cannot imagine having

had a better mentor.

I also extend thanks to the other graduate students in Vaughn Betz’s research group for all their

help and support. Also, thanks for the lunch outings, the coffee breaks and the squash matches, among

other things, that provided those much needed distractions.

I would like to thank the Natural Sciences and Engineering Research Council of Canada, Altera Cor-

poration and the University of Toronto for their financial support. Thanks also go to CMC Microsystems

for providing the CAD tools used throughout this research. I would also like to thank David Lewis from

Altera Corporation for the insightful discussions.

Finally, thanks must undoubtedly go to my parents for nurturing my inherent desire to know why,

for making me love the smell of new books, and simply, for being the best parents a kid could ask for.

All of my accomplishments are certainly due to them.

iii

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4

2.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Logic Block Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 Routing Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.3 Commercial BLE Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 FPGA Architecture Assessment Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 FPGA Circuit Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 SRAM cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Routing Multiplexers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.3 Lookup Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.4 Flip-Flops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Modeling of FPGA Circuitry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4.1 Area Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4.2 Delay Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 Automated Transistor Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 COFFE: Automated Optimization of FPGA Circuitry 17

3.1 Introduction to COFFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Circuit Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 Area Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.5 Delay Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5.1 Non-Linearity of Transistor Resistance and Capacitance . . . . . . . . . . . . . . . 26

3.5.2 Topology Dependence of Transistor Resistance . . . . . . . . . . . . . . . . . . . . 26

3.6 Wire Load Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.7 Transistor Sizing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.7.1 Divide-and-Conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.7.2 Pre-Determined P/N Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.7.3 Detailed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.8 Impact of Improved Wire Load Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

iv

3.8.1 Base Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.8.2 Target Process Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.9 Integration of COFFE with VPR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Efficient FPGA Circuitry 35

4.1 Fcout for Single-Driver Routing and Multiple BLE Outputs . . . . . . . . . . . . . . . . . 35

4.2 Transmission Gate FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2.1 Pass-Transistor Scaling Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2.2 Replacing Pass-Transistors with Transmission Gates . . . . . . . . . . . . . . . . . 37

4.2.3 Gate-Boosting Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.6 Area and Delay Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Separating VDD and VG for Low-Power FPGAs . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4 Track-Access Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Conclusions and Future Work 49

5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

A N-well Sharing Sample Layout 51

B FPGA Circuitry Schematics 53

C Detailed Transistor Sizing Results 56

D Area and Delay Breakdown 61

Bibliography 62

v

List of Tables

3.1 COFFE’s expected input architecture parameters. . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Resistance of a 4× minimum-width NMOS transistor for different circuit topologies (Fig-

ure 3.9) and switching-thresholds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Rise-fall re-balancing and the effect of M on COFFE’s transistor sizing solutions (example). 31

3.4 Base architecture parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5 Subcircuit count per tile for base architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.6 Metal layer data used by COFFE for all circuit design investigations (ITRS [19]). . . . . . 33

3.7 Impact of wire loading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.1 Effect of Fcout on channel width and switch block multiplexers. . . . . . . . . . . . . . . . 36

4.2 Area and delay for different Fcout values. . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.3 Pass-transistor and transmission gate FPGA tile area for different levels of gate boosting. 42

4.4 Switch block multiplexer transistor sizes for PT and TG implementations for different

levels of gate boosting (see Figure 4.2 for transistor labels). Note that with the exception

of P/N ratios, COFFE uses integer granularity. . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5 Pass-transistor and transmission gate FPGA critical path delay for different levels of gate

boosting (VTR benchmarks). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.6 Pass-transistor and transmission gate FPGA area-delay product for different levels of gate

boosting (VTR benchmarks). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.7 Pass-transistor and transmission gate FPGA relative dynamic power for different levels

of gate boosting (VTR benchmarks). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.8 Effect of cluster output track-access locality on area and delay. Input track-access span

is set to 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.9 Effect of cluster input track-access locality on area and delay. Output track-access span

is set to 0.25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

C.1 Lookup table transistor sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

C.2 Switch block multiplexer transistor sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

C.3 Connection block multiplexer transistor sizes. . . . . . . . . . . . . . . . . . . . . . . . . . 57

C.4 Local routing multiplexer transistor sizes. Note: We don’t give a size for buf2 of the local

routing multiplexer as it is replaced by the LUT input driver of Figure B.2. . . . . . . . . 57

C.5 BLE output to local interconnect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

C.6 BLE output to general routing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

C.7 Flip-flop and register selection multiplexer transistor sizes. . . . . . . . . . . . . . . . . . . 58

vi

C.8 LUT input driver A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

C.9 LUT input driver B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

C.10 LUT input driver C with register feedback multiplexer (Figure B.3). . . . . . . . . . . . . 59

C.11 LUT input driver D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

C.12 LUT input driver E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

C.13 LUT input driver F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

D.1 Tile area breakdown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

D.2 Critical path delay breakdown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

vii

List of Figures

1.1 Architecture exploration with manual (a) and automated (b) transistor-level design. . . . 2

2.1 Tile-based FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Basic logic element (BLE). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Logic cluster architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 Routing segment lengths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.5 Multi-driver and single-driver routing architectures. . . . . . . . . . . . . . . . . . . . . . . 8

2.6 FPGA architecture assessment methodology with VPR. . . . . . . . . . . . . . . . . . . . 9

2.7 Six transistor SRAM cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.8 Different 8:1 pass-transistor multiplexer topologies. . . . . . . . . . . . . . . . . . . . . . . 11

2.9 Multiplexer followed by two-stage buffer with PMOS level-restorer. . . . . . . . . . . . . . 11

2.10 Fully encoded MUX tree 3-LUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.11 Minimum-width transistor area model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.12 Increasing drive strength with diffusion widening (b) or parallel diffusion regions (c). Note:

Although not shown in the figure for simplicity, parallel diffusions must be connected

together. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 FPGA design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 COFFE’s supported tile architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 COFFE’s routing multiplexer circuit topologies. . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 Fully encoded MUX tree 6-LUT with internal re-buffering (partial view). . . . . . . . . . 22

3.5 Static transmission gate-based master-slave register. . . . . . . . . . . . . . . . . . . . . . 22

3.6 Transistor area prediction accuracy of original (Eq. 2.2) and improved (Eq. 3.1) area

models against TSMC 65nm layouts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.7 Combining diffusion widening and parallel diffusion regions yields denser layouts (c). . . . 24

3.8 A switch-level model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.9 Circuits used to measure transistor resistance. . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.10 Inverter NMOS and PMOS resistivity vs. transistor width. . . . . . . . . . . . . . . . . . 27

3.11 COFFE’s transistor sizing algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1 VDD and VTh scaling trends [12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Generic two-level routing multiplexer with two-stage buffer implemented with pass-transistors

(a) and transmission gates (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.3 Effect of different gate boosting strategies on transmission gate switch block multiplexer

delay (VDD = 0.8V ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

viii

4.4 CAD flow for each FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5 Tile area and critical path delay breakdown. . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.6 Critical path delay for pass-transistor (PT) and transmission gate (TG) FPGAs for dif-

ferent VDD and VG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.7 Dynamic power for pass-transistor (PT) and transmission gate (TG) FPGAs for different

VDD and VG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.8 Power-delay product for pass-transisor (PT) and transmission gate (TG) FPGAs for dif-

ferent VDD and VG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.9 Cluster output wire load for different locality. . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.10 Cluster input wire load for different locality. . . . . . . . . . . . . . . . . . . . . . . . . . . 47

A.1 A single-level 4:1 pass-transistor multiplexer with two-stage buffer and level restorer. . . . 51

A.2 Sample multiplexer layout with N-well sharing. . . . . . . . . . . . . . . . . . . . . . . . . 52

B.1 6-LUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

B.2 LUT input driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

B.3 LUT input driver with register feedback multiplexer. . . . . . . . . . . . . . . . . . . . . . 54

B.4 Two-level multiplexer used for switch block, connection block and local routing multiplexers. 54

B.5 2:1 multiplexer used for BLE outputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

B.6 Flip-flop with register input selection multiplexer. . . . . . . . . . . . . . . . . . . . . . . . 55

ix

Chapter 1

Introduction

1.1 Motivation

The design and fabrication of modern digital integrated circuits costs tens to hundreds of millions of

dollars, requires large teams of engineers and years of effort. Indeed, the cost of developping a new

20nm chip has been estimated to be as high as $160 million USD [1]. While this may be acceptable for

high-volume applications, it can be a significant burden for lower-volume designs, often preventing them

from being fabricated in the latest process technologies. Instead of being fabricated as a custom chip

such as a standard cell-based application-specific integrated circuit (ASIC) or a full custom design, a dig-

ital design can be implemented in a field-programmable gate array (FPGA). FPGAs are pre-fabricated,

programmable devices into which one can implement any arbitrary digital design in a matter of seconds.

Therefore, FPGAs are an attractive alternative to ASICs or full custom designs because they allow the

high non-recurring engineering costs and lengthy design times associated with semiconductor manufac-

turing to be completely avoided. However, digital designs that require high-density, high-performance

or low-power might not find FPGAs as attractive. It has been shown that FPGAs require 35× more

silicon area, are 4× slower and consume 14× more dynamic power than ASICs [25]. Accordingly, mini-

mizing the FPGA-to-ASIC gap, that is, making FPGAs as efficient as possible such that they become a

competitive implementation medium for all types of applications, is one of the primary drivers of FPGA

research for both academic researchers and commercial FPGA manufacturers.

The area, performance and power characteristics of an FPGA can be optimized at two main lev-

els: architecture and transistor-level design. The architecture of an FPGA is defined by a number of

parameters that describe the style and flexibility of its soft-logic blocks, dedicated hard-blocks and in-

terconnect. Finding an architecture that meets specific design goals and constraints involves setting

these architectural parameters to specific values. However, these parameters interact in complex ways to

produce area, delay and power trade-offs that are very difficult to quantify through analytical methods.

For that reason, finding the right architectural parameter values is usually accomplished experimentally

with automated architecture explorations tools such as VPR [7].

For any architecture, there are a number of different transistor-level implementations. Transistor-

level design consists of choosing circuit topologies for an architecture as well as sizing the transistors of

those circuits. Both circuit topology selection and transistor sizing provide opportunities to optimize the

area, delay and power of the architecture. In prior FPGA research work, transistor-level design was often

1

Chapter 1. Introduction 2

Manual transistor-level design

Evaluate architecture

Change architecture parameters

Initial architecture parameters

(a) Manual transistor-level design.

Automated transistor-level design

Evaluate architecture

Change architecture parameters

Initial architecture parameters

(b) Automated transistor-level design.

Figure 1.1: Architecture exploration with manual (a) and automated (b) transistor-level design.

performed manually making it a task that required a significant amount of time and effort. This often

had a negative impact on the architecture exploration flow, which would proceed as follows. Manual

transistor-level design would be performed on some initial architecture. Then, this architecture would be

assessed with an architecture exploration tool such as VPR. Based on the results of the assessment, the

architecture parameters would be adjusted and the evaluation process would be repeated. Ideally, one

would then re-optimize the transistor-level design to match the new architecture parameters. However,

since manual transistor-level design was such a time and effort intensive task, this step would often be

skipped. It was assumed that transistor sizes obtained with a previous architecture still applied to the

new architecture and this new architecture was then evaluated without re-optimizing its transistor-level

design. This architecture exploration flow is illustrated in Figure 1.1a. The new architecture could likely

be made more efficient if it’s transistor sizes were re-optimized. As well, the detailed impact of new

wire loads as the architecture and its area changed have often not been rigorously modeled, possibly

leading to inaccurate architecture conclusions. In an environment where FPGAs need to be as efficient

as possible to compete with ASICs, new architectures should be evaluated in their most efficient state.

It follows that re-optimizing the transistor sizes as the FPGA architecture is changed provides a more

thorough design space exploration and should yield more efficient FPGAs.

Automating the transistor-level design of FPGAs enables such frequent re-optimization (Figure 1.1b).

In addition, an automated transistor-level design tool facilitates investigations relating to efficient FPGA

circuitry. For example, an automated transistor-level design tool could be used to explore the impact of

different circuit topologies or the impact of different layout choices on the area, delay and power of an

FPGA.

This thesis consists of two parts. In the first, we develop COFFE (Circuit Optimization For FPGA

Exploration), a new fully-automated transistor sizing tool for FPGAs. Although an FPGA-specific

transistor sizing tool has been developed in prior work [24], we have made significant improvements that

are necessary in advanced process nodes. In the second part of this thesis, we use COFFE to investigate

a number of circuit design related questions in advanced process technology.

Chapter 1. Introduction 3

1.2 Thesis Organization

This thesis is organized as follows. Chapter 2 provides background information on FPGA architecture,

circuit design, modeling and optimization. Chapter 3 describes COFFE, a fully-automated transistor

sizing tool for FPGAs developed as part of this thesis, as well as our area and delay modeling enhance-

ments. A number of FPGA circuit design investigations are performed with COFFE in Chapter 4.

Finally, Chapter 5 concludes this thesis and suggests future work.

Chapter 2

Background

This thesis is focused on the transistor-level design of SRAM-based FPGAs and related computer-aided

design (CAD) tools. We develop a fully-automated transistor sizing tool for FPGAs in Chapter 3 and

use it to investigate a number of FPGA circuit design related questions in Chapter 4. This chapter

provides relevant background material. First, we review FPGA architecture and the standard FPGA

architecture assessment methodology. Then, we describe common practices in FPGA circuit design as

well as commonly used area and delay modeling techniques for these circuits. Finally, we review prior

work on automated transistor sizing.

2.1 FPGA Architecture

An FPGA consists of an array of tiles that can each implement a small amount of logic and routing.

Horizontal and vertical routing channels run on top of the tiles and allow them to be stitched together to

perform larger functions. Figure 2.1 illustrates FPGA tile architecture at a high-level. A logic block (LB)

supplies the tile’s logic functionality. Connection blocks (CBs) provide connectivity between logic block

inputs and routing channels. A switch block (SB) connects logic block outputs to routing channels and

provides connectivity between wires within the routing channels. One replicates this basic tile to obtain

a complete FPGA. Although Figure 2.1 shows logic and switching functions as distinct sub-blocks, an

interleaved layout is more realistic and is what we assume throughout this work.

The FPGA architecture described in the previous paragraph represents a generic soft-logic-based

FPGA. Modern FPGAs are more heterogeneous. That is, in addition to general purpose soft-logic

blocks, they also contain dedicated hard-blocks such as multipliers, block memories or even embedded

processors [36, 51, 4, 38]. In this work, we focus on the architecture and circuit design of the soft-logic

portion as it still forms the backbone of an FPGA and typically accounts for a large fraction of it’s area1

and critical path delay as shown in Section 4.2.6. However, since hard-blocks are an important part

of modern FPGA architectures, all our VPR [7] experiments are performed with architecture files that

contain multipliers and block memories along with our soft-logic blocks. We use the same multiplier and

block memory designs across all our VPR experiments, and hence they are constant and do not affect

the conclusions of our soft-logic investigations.

1In [50], it was reported that the core area of the largest Stratix III FPGA consists of ∼72% soft-logic and associatedprogrammable routing; the other 28% being block memory and multipliers.

4

Chapter 2. Background 5

LB

SBCB

FPGA Tile

Routing Channel

CB

Figure 2.1: Tile-based FPGA.

K-LUT FF

Figure 2.2: Basic logic element (BLE).

2.1.1 Logic Block Architecture

Most FPGAs are built around the idea of using lookup tables (LUTs) to implement logic functions. A

K-input LUT can implement any combinational logic function of K inputs. Since digital designs are

rarely purely combinational, the basic logic element (BLE) of an FPGA consists of a K-LUT and a

flip-flop (FF) that both feed a 2:1 multiplexer which allows the output of the BLE to be driven by either

the LUT output or the FF output as illustrated in Figure 2.2 [7].

Although an FPGA logic block could consist of a single BLE, it is much more common to group

several BLEs together in the same logic block to form a locally interconnected logic cluster as this fast

local interconnect can improve performance and save general routing area [7, 2]. The number of inputs

to a LUT (K) and the number of BLEs in a logic cluster (N) are two important architectural parameters

affecting the area and performance of an FPGA. Ahmed and Rose showed in [2] that K = 4 to 6 and

N = 4 to 10 are good choices in terms of area-delay product. Modern commercial architectures use

comparable values for N and K (Virtex 7: K=6, N=8 [51] and Stratix V: K=6, N=10 [35]).

As illustrated in Figure 2.3, a logic cluster’s local interconnect consists of two types of wires: local

feedback wires and cluster input wires. There are typically N local feedback wires in a cluster; one for

each BLE. Often, many BLEs in a cluster will share common inputs. Accordingly, the number of inputs

to a cluster (I) is less than the number of distinct BLE inputs in a cluster (i.e. N ×K). It was shown

in [2] that (2.1) is a good estimate of the number of inputs required to achieve 98% LUT utilization.

I =K

2(N + 1) (2.1)


BLE with internal details shown

BLE

BLE

Total of N BLEs

Local feedback

wires

I cluster inputs

K local routing MUXes

per BLE

K-LUT FF

N BLE outputs

Figure 2.3: Logic cluster architecture.

Local routing multiplexers connect multiple local interconnect wires to each BLE input. These

multiplexers are generally sparsely populated [29]. That is, BLE inputs can be connected to only a

fraction of the wires in the local interconnect; we refer to this fraction as Fclocal. Sparsely populating

the local routing multiplexers reduces their size and thus saves area. In [29], it was shown that reducing

Fclocal from 1.0 to 0.5 reduces area by 10% with no degradation in critical path delay. However, as

recommended by [29], between 2 to 5 spare cluster inputs should be added to (2.1) when sparsely

populating the local routing multiplexers to maintain routability.

2.1.2 Routing Architecture

Logic blocks are interconnected by programmable routing channels that run horizontally and vertically

on top of a tile (Figure 2.1). The number of tracks in a routing channel is refered to as its width (W).

In this work, we assume that the width of horizontal and vertical routing channels are equivalent, but

it is possible for them to be different. For example, the horizontal routing channels on Stratix FPGAs

are wider than the vertical channels due to the rectangular layout of their logic blocks [34].

A routing track is composed of wire segments that span one or more tiles. The length (L) of a routing

segment specifies the number of tiles that it spans. For example, Figure 2.4 shows a routing channel

that consists of four tracks of L = 2 wire segments and four tracks of L = 4 wire segments. Note that

staggering the start point of wire segments as in Figure 2.4 is necessary for a tile-based layout as it

ensures that all tiles remain identical [8].

A horizontal and a vertical routing channel intersect at each tile. The set of programmable switches

that allow connections to be made between routing tracks at this intersection is called a switch block

(SB in Figure 2.1). Switch block flexibility (FS) specifies the number tracks to which any track can

connect in a switch block. An FS of 3, where each horizontal track connects to another horizontal

track and two vertical tracks (and vice-versa), is common [49]. The specific tracks to which each track

connects is determined by the switch block pattern [7, 37] as well as the routing driver architecture. In


FPGA tiles

Length 2 wire segments

Length 4 wire segments

Figure 2.4: Routing segment lengths.

a multi-driver routing architecture (Figure 2.5a), a wire can be driven by multiple tri-state drivers at

multiple points along its length. In contrast, in a single-driver routing architecture (Figure 2.5b), a wire

can only be driven by a single multiplexer-based driver usually placed at one end of the wire. Figures

2.5a and 2.5b also show that logic block outputs connect to the routing tracks differently based on the

routing driver architecture. That is, multi-driver architectures connect logic block outputs directly to the

routing wires while single-driver architectures connect logic block outputs to the routing wires through

switch-block multiplexers. Although multi-driver routing architectures have been widely used in the

past [7, 2], single-driver routing has become the dominant routing architecture style in both academic

research [28, 27, 24] and commercial FPGAs [34, 33]. In this work, we focus on single-driver routing

architectures. In [28], Lemieux et al. found that FPGAs with single-driver routing had 9% lower delay

and were 25% smaller than FPGAs with multi-driver routing.

Connection block multiplexers connect multiple routing tracks to each logic block input (see Figure

2.5). The number of tracks that can connect to each logic block input is called the connection block

input flexibility (Fcin). Similarly, the number routing wires that each logic block output can connect to

is given by the connection block output flexibility (Fcout). Reducing Fcin from W to 0.2W as the logic

cluster size increases from N = 1 to 20 and using an Fcout of W/N were found to be good choices in [7].

These interconnect flexibility values have generally been used as rules of thumb in subsequent FPGA

research.

2.1.3 Commercial BLE Architectures

The BLEs of modern commercial FPGAs are much more complex than the commonly used academic BLE

described in Section 2.1.1 (Figure 2.2). Instead of a single K-LUT, some modern FPGA architectures

[33, 35, 51] use fracturable LUTs, which are LUTs that can be configured as one large LUT or multiple

smaller LUTs. For example, the Stratix V fracturable 6-LUT can be split into two 5-LUTs or four

4-LUTs provided that the functions being mapped to these LUTs meet certain constraints [35]. Modern

BLEs also commonly support configuring LUTs as memories (LUTRAM) or shift registers and usually

contain hard arithmetic carry logic [35, 52]. However, to keep the scope of this work tractable, we only

consider regular K-LUTs, which are still relevant, and we do not consider carry logic as current academic

CAD tools do not fully support this functionality.

The commonly used academic BLE shown in Figure 2.2 has a very limited ability to use both the

lookup table and flip-flop together. Modern commercial BLEs include additional 2:1 multiplexers to

allow the lookup table and flip-flop to be used in concert in many more ways [3, 52]. These extra

multiplexers are included in our designs and will be described in more detail in Section 3.2.


LB CBLB

SBCB

CB LB CB

Connection block MUX

Tri-state drivers

Drivers at mid-points

LB output connects to routing wire via

tri-state driver

(a) Multi-driver architecture.

LB CBLB

SBCB

CB LB CB


Switch block MUX

No drivers at mid-points

LB output connects to routing wire via

SB mux

(b) Single-driver architecture.

Figure 2.5: Multi-driver and single-driver routing architectures.


VPR architecture description file

VPR

Synthesized benchmark circuits

Pack into logic clusters

Place clusters into FPGA

Route connections between clusters

Analyze timing and area

Benchmark circuits

Synthesize and map circuits to FPGA LUTs, FF, etc.

Architecture description

Figure 2.6: FPGA architecture assessment methodology with VPR.

2.2 FPGA Architecture Assessment Methodology

The quality of an FPGA in terms of area, performance and power consumption is a function of the

architectural parameters described in Section 2.1. These architecture parameters interact in complex

ways; hence determining the best choice for each parameter is a challenging task. Although there has

been some work towards developing analytical models to evaluate FPGA architectures [46, 26, 16],

the standard architecture assessment procedure used by both commercial FPGA manufacturers and

academic researchers is an experimental one that consists of implementing benchmark circuits on a

candidate architecture in order to evaluate its area, delay and power.

Figure 2.6 shows the standard academic CAD flow used to evaluate FPGA architectures [7]. The

CAD flow proceeds as follows. Benchmark circuits are first synthesized and mapped into lookup tables

(LUTs), flip-flops (FF) and hard-blocks (multipliers and block memories) based on a description of the

architecture. LUTs and FFs are then packed into clusters in a manner that attempts to keep related

LUTs and FFs in the same cluster such that connections between them can be routed through the logic

cluster’s fast local interconnect. Next, each cluster is placed into a specific logic block on the FPGA

that minimizes both the delay and the wire length of connections between logic clusters as much as

possible. Once all logic clusters have been placed, connections between logic blocks are routed through

the FPGA’s general purpose interconnect. The routing algorithm tries to minimize the benchmark

circuit’s critical path delay, while using the least amount of routing resources possible. Finally, timing

analysis is performed to determine the implemented benchmark circuit’s critical path delay and area is

calculated based on tile area and the number of logic blocks required by the placement.

The packing, placement and routing phases of the flow of Figure 2.6 are performed by VPR [7].

Since many of the algorithms used by VPR are timing-based, the VPR architecture file must describe


VSRAM+

VSRAM-

WL

BLBL

Figure 2.7: Six transistor SRAM cell.

the delays through the lookup tables, routing multiplexers and any other circuitry that makes up the

FPGA. The delay of these circuits depend on the circuit topologies used, as well as the transistor sizing

of the FPGA circuitry. Consequently, evaluting an FPGA architecture requires first completing its

transistor-level design.

2.3 FPGA Circuit Design

As mentioned in Section 2.1, we only consider soft-logic-based FPGAs with single-driver routing architec-

tures in this thesis. Soft-logic FPGA architectures consists entirely of SRAM cells, routing multiplexers,

lookup tables and flip-flops. This section describes commonly used circuit topologies and circuit design

practices for these structures.

2.3.1 SRAM cells

An FPGA typically contains millions of memory bits used to configure routing multiplexers and store

lookup table logic functions. Because there are so many of them, a key design goal for these memory

bits is small area. Stability is also important, as state flipping would cause problems such as incor-

rectly configured routing multiplexers. A six transistor SRAM cell (Figure 2.7) has been the standard

implementation in FPGA research [7] as it achieves both design goals reasonably well.

2.3.2 Routing Multiplexers

Routing multiplexers account for a large fraction of the area and delay of an FPGA. Consequently,

it is crucial to choose a circuit implementation that is as efficient as possible. There are a number

of approaches that can be taken to build a multiplexer but most commercial FPGAs and almost all

academic FPGA studies use an NMOS pass-transistor-based approach because each switch requires only

one transistor, minimizing area. Figure 2.8 shows three of the most commonly used pass-transistor

multiplexer topologies. Each multiplexer style possesses a different area-delay tradeoff that is a function

of the number of multiplexer inputs [27, 9]. For example, since it has just one pass-transistor on the

signal path, a 1-level multiplexer is generally faster than a 2-level multiplexer. But, for a large number

of inputs, a 1-level multiplexer requires more SRAM cells than a 2-level multiplexer and can thus have

larger area. Furthermore, if the the number of inputs is large enough, a 1-level multiplexer could even

become slower than a 2-level multiplexer due to a greater number of transistors loading the output node.


SRAM cell

out

(a) Tree MUX.

SRAM cell

out

(b) 1-level MUX.

SRAM cell

out

(c) 2-level MUX.

Figure 2.8: Different 8:1 pass-transistor multiplexer topologies.

Level-restorer

out

2-stage bufferMUX

Figure 2.9: Multiplexer followed by two-stage buffer with PMOS level-restorer.

It was shown in [9] that a 2-level multiplexer generally yields a lower area-delay product than a 1-level

or tree multiplexer. Commercial FPGAs also commonly use 2-level multiplexers [33].

Although they are beneficial in terms of area, pass-transistors have an important disadvantage: they

are incapable of passing a full logic-high voltage. That is, their output voltage saturates at approximately

VG − VTh where VG is the gate voltage and VTh is the threshold voltage of the transistor. In FPGA

circuitry, the output of a pass-transistor-based routing multiplexer is typically driven by a multi-stage

buffer [7, 30, 33]. Static power dissipation in these buffers caused by the reduced voltage swing of pass-

transistors has long been a cause for concern [7]. To mitigate this problem, gate boosting [7] (applying

a voltage larger than the supply voltage (VDD) on the pass-transistor gate) and PMOS level-restorers

[30, 33] have been used to help pull pass-transistor output voltages up to VDD. Figure 2.9 shows a

routing multiplexer followed by a two-stage buffer equiped with a PMOS-level restorer.


SRAM cells

Level-restorer

out

2-stage buffer3-LUT

LUT inputs

A B C

Figure 2.10: Fully encoded MUX tree 3-LUT.

2.3.3 Lookup Tables

Like routing multiplexers, lookup tables also use pass-transistor-based multiplexer circuitry but, the

multiplexer input and control connectivity is reversed. In a lookup table, SRAM cells connect to the

inputs of the multiplexer and hold the logic functions truth table, while the gates of the multiplexer

are controlled by the lookup table inputs. Consequently, lookup tables are generally implemented as

fully-encoded multiplexer trees, such that each level of the tree can be connected to a LUT input [7].

Figure 2.10 shows a 3-input fully encoded multiplexer tree lookup table followed by a two-stage buffer.

2.3.4 Flip-Flops

Flip-flops are generally implemented as standard master-slave registers [7]. However, some commercial

FPGAs use flip flops that are more advanved. For example, Altera’s Stratix V FPGAs use flip-flops

based on pulse latches and configurable pulse width generators to improve performance [35].

2.4 Modeling of FPGA Circuitry

Evaluating an FPGA architecture with the assessment methodology described in Section 2.2 requires that

we develop models that allow us to estimate the area and delay of FPGA circuitry because fabricating

an integrated circuit for each architecture to measure area and delay is obviously not practical. In this

section, we describe commonly used area and delay modeling approaches for FPGAs. These models are

also useful for transistor sizing, which we will discuss in Section 2.5.


Minimum-width transistor

Minimum-width transistor area

Space to neighboring transistors

Metal/polysilicon gate

Metal contact

Diffusion

Figure 2.11: Minimum-width transistor area model.

2.4.1 Area Modeling

Creating a complete layout is the best way to determine the exact area of an FPGA. However, this

process is much too time consuming when multiple designs need to be explored. A variety of different

approaches have been used to more quickly estimate area such as counting transistors or counting SRAM

cells, but the most widely used in FPGA research is the minimum-width transistor area model introduced

in [7]. In this model, layout area is expressed in units of minimum-width transistor areas. A minimum-

width transistor is defined as the smallest possible contactable transistor for a specific process technology

and one minimum-width transistor area is the area of this transistor plus the spacing to neighboring

transistors as shown in Figure 2.11. Unlike area models that simply count transistors or SRAM cells, the

minimum-width transistor area model provides an actual estimate of layout area. This is an important

distinction because as well as being more accurate, actual layout area estimates enable better estimates

of wire loads since wire lengths are layout dependent.

Transistors in FPGA circuitry often require more drive strength than that provided by a minimum-

width transistor. A transistor’s drive strength can be increased by either widening its diffusion region

(Figure 2.12b) or by adding parallel diffusion regions (Figure 2.12c). Consequently, increasing a tran-

sistor’s drive strength increases it’s area. The widely-used area model of [7] estimates the layout area

of a transistor with drive strength x, in units of minimum-width transistor areas, with (2.2), which was

obtained by averaging the layout areas that result from either widening the diffusion region or adding

parallel diffusion regions to increase drive strength.

Area(x) = 0.5 + 0.5x (2.2)

Then, [7] calculates the area of an FPGA subcircuit by simply summing the areas of all the transistors

in that subcircuit. Note from (2.2) that doubling a transistor’s drive strength does not double it’s area.

This is due to the fact that increasing a transistor’s drive strength only increases certain transistor

dimensions while others remain constant. For example, the spacing to neighboring transistors remains

the same regardless of a transistor’s drive strength.


1x minimum contactable

width

(a) Minimum drive strength.


width

(b) 2× minimum drive strength.


width

2 parallel diffusions

(c) 2× minimum drive strength.

Figure 2.12: Increasing drive strength with diffusion widening (b) or parallel diffusion regions (c). Note:Although not shown in the figure for simplicity, parallel diffusions must be connected together.

2.4.2 Delay Modeling

Time-domain circuit simulators such as HSPICE are generally the most accurate way to estimate the

delay of a circuit. However, time-domain simulation can be computationally intensive making it im-

practical when a large number of delay measurements need to be obtained. For example, the timing

analysis phase of the architecture assessment flow described in Section 2.2 involves measuring delay for

the thousands of nets in a benchmark circuit; performing time-domain simulation for each one would lead

to prohibitively long runtimes. Instead, previous FPGA research work has typically modeled wires and

transistors as linear resistances and capacitances, such that a transistor-based circuit can be modeled as

an RC-tree network [22, 7, 24]. The delay of this network can then be estimated with the Elmore [15] or

the Penfield-Rubinstein [20] delay models, which are much quicker than time-domain simulations. With

the Elmore delay model, the delay TD of a path is given by:

TD =∑

i ∈ path

Ri · C(subtreei) (2.3)

where Ri is the equivalent resistance of element i along the path and C(subtreei) is the total downstream

capacitance rooted at element i.

An enhanced version of the Elmore delay model was proposed in [39]. Since it is more difficult to

model a buffer as a simple RC circuit due to the buffer’s intrinsic delay, [39] combines the Elmore delay

model with a common model of buffer delay where a buffer is modeled as a constant delay and a resistor.

This approach maps well to FPGA circuitry, which consists mostly of pass-transistors and buffers, and

was adopted as the delay model for VPR in [7]. With this model, the delay TD of a path is given by:

TD =∑

i ∈ path

Ri · C(subtreei) + Tbuf,i (2.4)

where Tbuf,i is the buffer’s intrinsic delay if element i is a buffer or 0 otherwise [7].

2.5 Automated Transistor Sizing

Transistor sizing is a well-studied problem that consists of improving a circuit’s performance by increasing

the sizes of its transistors and thus provides yet another level, in addition to architecture and circuit

design, at which the area and delay characteristics of an FPGA can be adjusted. The transistor sizing

optimization problem is usually formulated in one of three ways:


1. Minimize some function of area and delay.

2. Minimize area subject to a delay constraint.

3. Minimize delay subject to an area constraint.

There has been much prior work on automated transistor sizing for custom circuitry. Fishburn and

Dunlop showed in [17] that modeling transistors as linear resistances and capacitances and calculating the

delay of the resulting RC circuits with the Elmore [15] or the Penfield-Rubinstein [20] delay model (i.e.

(2.3)) allows the transistor sizing problem to be formulated as a convex optimization problem, which

guarantees that any local minimum is the global minimum. With this useful property, [17] develops

TILOS, a transistor sizing tool for custom circuits based on a heuristic method that iteratively identifies

a circuit’s critical path and increases transistor sizes on that path until all timing constraints are met.

Despite the convexity of the problem, TILOS’s heuristic is such that it can terminate with a sub-

optimal solution [45]. Algorithms guaranteeing the optimal solution through convex optimization [44]

or mathematical relaxation techniques [10, 47] have subsequently been proposed but these algorithms,

along with TILOS, all suffer from their reliance on linear device models and the Elmore delay, which have

long been known to be inaccurate [40, 21]. To enhance accuracy, at the cost of increased computational

complexity, some transistor sizing algorithms have used time-domain simulation to obtain delay estimates

[14, 13].

The programmability of FPGAs adds unique features to the transistor sizing problem which makes

FPGA-specific transistor sizing tools valuable. Kuon and Rose proposed such a tool in [24]. Their FPGA

transistor sizing approach is different than transistor sizing algorithms for custom circuits because it deals

with two features unique to FPGAs. The first of these unique features is repitition. As described in

Section 2.1, an FPGA consists of an array of tiles. Since these tiles are all identical, transistor-level design

only needs to be performed for one of them. This design can then be replicated to obtain a complete

FPGA. Similar design space reductions can be found within a tile. For example, a switch block can

include over 100 logically equivalent multiplexers whose transistor-level design should be kept identical.

Consequently, only ∼80 unique transistors need to be sized when designing an FPGA’s soft-logic despite

there being billions of transistors on the chip, which is in contrast to transistor sizing for custom circuits

where the whole chip must be considered. This reduced design space makes HSPICE-based optimization

practical for FPGAs, but as we show in Section 3.7, we must still search this space intelligently to keep

runtime reasonable.

The second unique feature to FPGA transistor sizing is their undefined critical path. Because they are

programmable, FPGAs have application-dependent critical paths which implies that at design time, there

is no clear critical path to optimize for delay. To deal with this issue, [24] optimizes a representative

path that contains one of each type of FPGA subcircuit (LUTs, MUXes, etc.). Delay is taken as a

weighted sum of the delay of each subcircuit and the weighting scheme is chosen based on the frequency

with which each subcircuit was found on the critical paths of placed and routed benchmark circuits.

Optimizing a representative critical path still presents a huge design space which Kuon and Rose tackle

with a two-phased algorithm that consists of an exploratory phase that utilizes linear device models and

a TILOS-like transistor sizing heuristic to keep CPU times reasonable, followed by an HSPICE-based

fine-tuning phase that adjusts the transistor sizes to account for the inaccuracies of linear models.

In [46], Smith et al. present a method that enables the rapid and concurrent optimization of high-

level architecture parameters and transistor sizes for FPGAs through the use of analytic architecture


models, linear device models and a convex optimization-based transistor sizing algorithm. They show

that this concurrent optimization can have a significant impact on architectural conclusions versus a

separate optimization.

Chapter 3

COFFE: Automated Optimization of

FPGA Circuitry

When developing a new chip, FPGA architects are faced with two main tasks: choosing an architecture

for their FPGA and performing the transistor-level design of that architecture. As described in Section

2.2, choosing an architecture is typically accomplished experimentally with architecture exploration tools

such as VPR [7]. By implementing benchmark circuits on a proposed FPGA, these tools allow architects

to evaluate the area, delay and power impact of various architectural choices. Then, based on their

observations, architects can select an FPGA architecture that meets their design goals and constraints.

Transistor-level design consists of selecting circuit topologies for the various subcircuits that im-

plement the chosen architecture, as well as sizing the transistors of those subcircuits. Transistor-level

design is an essential precursor to the evaluation of an architecture because it provides accurate area,

delay and power estimates of the underlying FPGA circuitry; these estimates are required inputs to

the architecture exploration tools. Transistor sizing also provides an additional opportunity to tune the

area, delay and power of an FPGA. Therefore, developing a new FPGA is an iterative process that

involves performing the transistor-level design of various architectures before evaluating them through

synthesis, placement and routing experiments. This interdependence between architecture exploration

and transistor-level design necessitates automated design tools if high-quality results are to be obtained

in reasonable amounts of time.

In this chapter, we describe COFFE (Circuit Optimization For FPGA Exploration), a fully-automated

transistor sizing tool for FPGAs that we developed as part of this thesis. COFFE enables the design

flow detailed above by providing area, delay and power estimates of properly sized FPGA circuitry.

COFFE also enables design exploration of FPGA circuitry and we will use COFFE in such a capacity

in Chapter 4. Although COFFE solves the same problem as Kuon and Rose’s FPGA transistor sizing

tool [24] (see Section 2.5), we have made significant improvements which are necessary for FPGAs in

advanced process nodes; these improvements will be described in the following sections.

3.1 Introduction to COFFE

Figure 3.1 shows the FPGA design flow we wish to enable with COFFE. COFFE is used to perform

transistor-level optimization for some architecture of interest, thus producing accurate area and delay

17

Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 18

Wire load model

Circuit OptimizerArea model

HSPICE

Subcircuit SPICE netlists

Generate subcircuit SPICE

netlists

COFFE

Optimization objective

Process models

Subcircuit areas and delays

(VPR arch. file)

Typical critical path

(delay weights)

Architecture parameters

VPR

Pack

Place

Route

Benchmark circuits

Analyze timing

and area

Figure 3.1: FPGA design flow.

estimates for the subcircuits of this architecture (LUTs, routing multiplexers, etc.). These estimates are

used by VPR to evaluate the architecture through place and route experiments. Based on the results

of the assessment, the architecture parameters are adjusted and sent back to COFFE to begin a new

iteration of optimization and evaluation.

COFFE’s circuit optimizer makes area and performance trade-offs through transistor sizing. Like

[24], COFFE’s optimization objective is of the form AreabDelayc thus allowing for different area and

performance tradeoffs by varying b and c. Creating a complete layout is the most accurate way to obtain

the area and delay measurements needed during transistor sizing. However, for the iterative design flow

of Figure 3.1, this approach is impractical as layout is a very time consuming task. Instead, COFFE

estimates area with an improved version of the minimum-width transistor area model (see Section 3.4)

and measures delay with HSPICE simulations. Although previous FPGA transistor sizing tools have

used linearized models of transistors to measure delay during certain phases of the optimization, we

show in Section 3.5 that such models are highly inaccurate for the fine-grained transistor-level design we

wish to undertake in advanced process nodes such as the 22nm process we use in this work.

COFFE automatically generates the SPICE netlists required for delay measurement based on the

input architecture parameters and the circuit topologies described in Sections 3.2 and 3.3 respectively.

These netlists are parametrized such that COFFE’s circuit optimizer can change the transistor sizes

by simply changing a transistor size parameter list. To obtain meaningful delays, COFFE is careful to

ensure that these netlists include realistic transistor and wire loading. Transistor loads are relatively

easy to determine based on architectural parameters and circuit topologies. Wire loads, on the other

hand, are layout dependent making them more difficult to determine since the exact layout is not known.

COFFE estimates wire loads with the model described in Section 3.6.

3.2 Architecture

Figure 3.2 shows the tile architecture that COFFE supports in its designs and Table 3.1 lists the archi-

tecture parameters that COFFE expects as inputs. Parameters listed in the top portion of Table 3.1


K-LUT FF

BLE with internal details shown

AB C

D

Logic Cluster

BLE

BLE

Total of N BLEs

Switch block MUX

Vertical routing channel

N•Ofb local feedback

wires

I cluster input wires

K local routing MUXes

per BLE


Ofb

Or

Horizontal routing channel

Figure 3.2: COFFE’s supported tile architecture.

are commonly used in FPGA research and were described in Sections 2.1.1 and 2.1.2. COFFE supports

routing wires of any length (L) but currently, all routing wires in a channel must be of the same length.

That is, architectures with multiple wire segment lengths, such as an architecture with both L = 4 and

L = 8 wire segments, are not supported. Note that COFFE uses directional, single-driver routing wires.

Parameters listed in the bottom portion of Table 3.1 are new and help COFFE describe a more flexible

BLE than the commonly used academic BLE shown in Figure 2.2. The COFFE BLE still consists of

a K-input LUT and FF but supports optionally including additional 2-input multiplexers to allow the

LUT and FF to be used simultaneously in many more ways. These extra 2:1 MUXes can potentially

help improve density and speed and are similar to the ones used in Stratix [34]. The new multiplexers

and their use are described below.

1. Register feedback multiplexer. A FF output driving a LUT input is a common occurrence in placed

and routed benchmark circuits. With the BLE of Figure 2.2, this requires driving the FF output

onto the cluster’s local interconnect and connecting this signal to a LUT input in another BLE.

Including a register feedback multiplexer (MUX-A in Figure 3.2) on a LUT input allows the FF

output to directly drive a LUT input in the same BLE, which saves routing resources in addition

to providing a faster connection. COFFE’s Rfb parameter allows optionally including a register

feedback multiplexer on a LUT input. Each LUT input has an Rfb parameter.

2. FF input selection multiplexer. The register feedback multiplexer (MUX-A) is made more useful

by also including a FF input selection multiplexer (MUX-B). With this multiplexer, the FF can

accept its input either from the LUT output or from a BLE input. For the “FF output driving

a LUT input” example described previously, MUX-B enables first registering a BLE input signal

before connecting it to a LUT input via MUX-A, all in the same BLE. The FF input selection


Table 3.1: COFFE’s expected input architecture parameters.

Parameter Description

K LUT sizeN Cluster sizeI Number of cluster inputsFcin Cluster input connection flexibilityFcout Cluster output connection flexibilityW Routing channel widthL Wire segment lengthFS Switch block flexibilityFclocal Local interconnect to BLE input connection flexibility

Rfb Register feedback per LUT input (on/off)Rsel Register input select (LUT only/LUT and BLE input)Ofb Number of local feedback outputs per BLEOr Number of general routing outputs per BLE

multiplexer is also useful for another reason: it enables the use of the LUT and FF of the same BLE

for two completely unrelated functions. COFFE’s Rsel parameter allows the optional inclusion of

the FF input selection multiplexer.

3. BLE output multiplexers. Using the LUT and FF of a BLE for two unrelated functions requires

that we can drive both the LUT output and FF output onto the local interconnect or general

routing at the same time. Prior academic work has assumed one output per BLE, but COFFE

adopts a more general model with BLEs that support variable numbers of local feedback and

general routing outputs, as specified by the Ofb and Or parameters respectively. All BLE outputs

can be driven by either the LUT output or the FF output (MUX-C and MUX-D).

COFFE uses a two-sided architecture, which means logic block inputs and outputs can only access

the two routing channels (one vertical and one horizontal) which run over top of the tile, as shown in

Figure 2.1. Implementing the routing wires over the logic in this way leads to a more efficient layout

and is the common commercial practice. Four-sided architectures (capable of accessing 2 vertical and 2

horizontal channels) have often been assumed in prior work but are less realistic since such architectures

would be difficult to lay out when the routing wires run over the logic. We have performed VPR

experiments with the architecture described in Section 3.8.1 (N = 10, K = 6) and found that using

the more realistic two-sided architecture results in a 3-4% critical path delay increase and 8-9% routed

wire length increase over a four-sided architecture. Note from Figure 3.2 that COFFE does not include

track buffers on connection block multiplexer inputs, which have often been used in academic work [7],

because these buffers are difficult to lay out and are not used in modern commercial architectures.

3.3 Circuit Topologies

COFFE uses a two-level multiplexer topology as shown Figure 3.3a for all multiplexers except for the

2:1 multiplexers inside the BLE, which are implemented with a single level and a shared SRAM bit as

shown in Figure 3.3b. An important parameter in the design of two-level multiplexers is the size of each

level. If S1 and S2 are the sizes of the first and second levels respectively, any combination of S1 and S2


Level-restorer

out

SRAM cell

2-stage buffer

buf1

buf2

lvl1 lvl1

lvl2

2-level MUX

(a) Routing multiplexer topology.

Level-restorer

out

SRAM cell

2-stage buffer2:1 MUX

(b) 2:1 multiplexer topology.

Figure 3.3: COFFE’s routing multiplexer circuit topologies.

such that S1 × S2 = MUXsize is a possible MUX topology. Since SRAM cells typically occupy 35-40%

of tile area (as shown in Section 4.2.6), we choose a multiplexer topology that minimizes the number

of SRAM cells required by having S1 ≈ S2. The output of each multiplexer is driven by a two-stage

buffer enabling it to drive a downstream capacitance that is frequently large. Note from Figure 3.3 that

COFFE includes PMOS level restorers to help pull degraded pass-transistor outputs to VDD.

COFFE implements LUTs as fully encoded multiplexer trees. Since the delay of a chain of pass-

transistors increases quadratically, COFFE supports internal re-buffering along a chain of pass-transistors.

Figure 3.4 shows a portion of a 6-LUT with internal re-buffering after 3 levels of pass-transistors. We

also experimented with an internal re-buffering topology that consisted of placing two separate inverters

along the pass-transistor chain: one after 2 pass-transistors and one after 4 pass-transistors. However,

due to the degraded output of pass-transistors, this topology requires a PMOS level-restorer at each in-

verter and skewed P/N ratios, making the two-stage buffer topology shown in Figure 3.4 more efficient.

The architectures that we use in this work are all 6-LUT architecture and are based on the circuit topol-

ogy shown in Figure 3.4. Isolation inverters between the SRAM cells and the pass-transistor tree are

also included in COFFE’s LUT topologies. These inverters improve robustness by isolating the SRAM

cells from the constantly switching pass-transistor tree, thus preventing capacitive noise injection, which

could upset an SRAM cell. As well, the isolation inverters improve speed since they are larger than the

inverters inside the SRAM cell and can better drive a signal down the pass-transistor tree. Each level of

the pass-transistor tree is controlled by a LUT input driver. Each LUT input driver will have to drive

a different load and so COFFE sizes each one distinctly.

Finally, similar to [7], COFFE implements flip-flops as a static transmission gate-based master-slave

register as shown in Figure 3.5 . As we will show in Section 4.2.6, the impact of the flip-flops on critical

path delay and tile area is relatively small. The former is due to the fact that a timing path can consists

of at most two flip-flops: one at the beginning and one at the end. The latter is due to the fact that flip-

flop circuitry is inherently small compared to lookup tables and far outnumbered by routing multiplexer

circuitry. Consequently, we did not explore different flip-flop implementations.


IN_A IN_B IN_C IN_D IN_E IN_F

SRAM

LUT input drivers

Figure 3.4: Fully encoded MUX tree 6-LUT with internal re-buffering (partial view).

set

set reset

reset

clk

clk

clk

clk

D Q

Figure 3.5: Static transmission gate-based master-slave register.

3.4 Area Modeling

The most accurate way to determine the area of an FPGA is to create a complete layout. However,

as described in Section 3.1, designing an FPGA is an iterative process, making layout of each iteration

impractical. FPGAs are generally transistor area limited [7]. Hence, a fast-to-compute but accurate

estimate of transistor layout area is needed. In this work, we use the minimum-width transistor area

model described in Section 2.4.1 but we make a number of improvements to enhance its accuracy in

advanced process nodes.

Figure 3.6 shows that (2.2) over-predicts transistor area by as much as 143% compared to our manual

layouts with TSMC’s 65nm layout rules (which were the most advanced layout rules to which we had

access). In [24], the constants in (2.2) were adjusted to match more advanced process rules but its

area estimates for our 65nm layouts are still inaccurate (Figure 3.6). Consequently, we develop a new

version of the minimum-width transistor area model whose accuracy is improved in several ways. First,

we assume reasonably square layouts. Recall from Section 2.4.1 that, to obtain (2.2), [7] averages the


0 5 10 15 20 25 30 35Drive Strength (xMin.)

0

2

4

6

8

10

12

14

16

18

Min

imum

Wid

thTr

ansi

stor

Are

as

LayoutOriginal Model [7]Kuon and Rose [24]Improved Model

Figure 3.6: Transistor area prediction accuracy of original (Eq. 2.2) and improved (Eq. 3.1) area modelsagainst TSMC 65nm layouts.

layout areas that result from either widening the diffusion region or adding parallel diffusion regions to

increase drive strength. For large transistors, however, both approaches yield layouts with very high

aspect-ratios (see Figures 3.7a and 3.7b). We found that smaller area can be obtained by keeping a

reasonably square transistor layout, which is accomplished by combining both diffusion widening and

parallel diffusion regions to increase a transistor’s drive strength as in Figure 3.7c. Therefore, our manual

layouts in Figure 3.6 use square layouts.

Second, we develop a new transistor area equation tailored towards more advanced process tech-

nologies by using a least-square fit of our 65nm layout areas versus drive-strengths to obtain area as a

function of drive-strength.

Area(x) = 0.447 + 0.128x+ 0.391√x (3.1)

Figure 3.6 shows that (3.1) predicts transistor area with much more accuracy than (2.2). We make

two further enhancements to the model to better estimate the layout density of different structures. The

area model described thus far does not account for the fact that in a design with both NMOS and PMOS

transistors, extra spacing is required for N-wells. It would be pessimistic to assume that each PMOS

transistor is in a separate well as the amount of N-well spacing required can be reduced by placing

multiple PMOS transistors in the same well. Although it is difficult to predict how much well-sharing

is possible in a given layout, our sample layouts suggest that well sharing can reduce the per-transistor

well spacing required by approximately 75% (see Appendix A for sample layout). With this estimate,

we derive the following equation to calculate the area of transistors requiring N-well spacing.

Area(x) = 0.518 + 0.127x+ 0.428√x (3.2)

COFFE calculates the area of NMOS pass-transistors with (3.1) and the area of CMOS transistors



width

(a) 15× minimum drive strength achieved through diffusion widening.


width


(b) 15× minimum drive strength achieved through parallel diffusion regions.

5x

min

imu

m

con

tact

able

wid

th


(c) 15× minimum drive strength achieved through both diffusion widening and parellel diffusion regions.

Figure 3.7: Combining diffusion widening and parallel diffusion regions yields denser layouts (c).


Cdiff CdiffReq

Cgate

S D

G

S D

GS D

G

Figure 3.8: A switch-level model.

(e.g. inverters and transmission gates) with (3.2). We find that accounting for N-well spacing increases

our tile area estimates by ∼2% for a pass-transistor based FPGA.

Finally, despite the fact that 6 small transistors are required per SRAM cell, we use an area of 4

minimum-width transistors because a denser, more optimized layout is typical for such a frequently used

cell.

3.5 Delay Modeling

We define switch-level modeling as the characterization of complex, non-linear MOSFET transistors into

a set of linear resistances and capacitances (Figure 3.8). Although they are less accurate at modeling

transistor behavior than circuit simulators like SPICE, switch-level models are often used for delay

estimation [40, 17, 24, 46] because the delay of the equivalent RC circuits can be computed with the

Elmore [15] or the Penfield-Rubinstein [20] delay models which are much quicker than the time-domain

simulations required to measure delay with SPICE. In addition, [17] showed that when transistors are

treated as linear resistances and capacitances, the transistor sizing problem can be formulated as a

convex optimization problem, thus guaranteeing that a local minimum is always the global minimum.

The resistive and capacitive behavior of a transistor is influenced by a variety of factors such as

its operating-point, its size and the shape of the input waveform. Therefore, switch-level models are

most accurate when estimating delay for circuits that exhibit a high degree of regularity (e.g. a circuit

composed of a few basic gates with a limited number of P/N ratios) because many transistors will

experience similar operating conditions. Different resistance and capacitance values (Req, Cgate and

Cdiff ) can be used for each group of transistors experiencing similar operating conditions to construct

a reasonably accurate switch-level model.

FPGA circuit design consists of custom, fine-grained transistor-level design which can lead to a large

variation in transistor operating conditions. Using PTM 22nm HP device models [42] and HSPICE, we

experimented with switch-level modeling for some of the circuit topologies commonly used in FPGAs.

In the following sections, we highlight some of the reasons why we found that switch-level models were

not suitable for our purposes.

3.5.1 Non-Linearity of Transistor Resistance and Capacitance

We use a chain of five loaded inverters (Figure 3.9a) to find the equivalent switching resistance for the

NMOS and PMOS of an inverter. Using a large Cload to minimize the effects of transistor capacitances,


CloadCloadCloadCloadCload

trisetfall

(a)

Cload

trise

(b)

Cload

tfall

(c)

Cload

trise

(d)

Cload

tfall

(e)

Figure 3.9: Circuits used to measure transistor resistance.

we simulate this circuit with HSPICE for several transistor widths and measure the rise and fall times of

the third inverter in the chain (to avoid end-effects). The rise time, trise, is measured as the time it takes

for the inverter output to rise from 0V to VDD/2 and the fall time, tfall, the time it takes for the output

to fall from VDD to VDD/2. Delay measurement in both cases starts when the input of the inverter is

at VDD/2. With the rise and fall times, NMOS and PMOS switching resistances can be computed as

RN = tfall/Cload and RP = trise/Cload. As shown in Figure 3.10, our experiments show that transistor

resistance varies with transistor width, particularly for smaller transistors. We found the same to be

true for transistor capacitance. This implies that transistor resistance and capacitance are non-linear

functions of transistor width and as a result, an accurate switch-level model would require a table of

pre-computed resistances and capacitances for many different transistor widths.

3.5.2 Topology Dependence of Transistor Resistance

The switching resistance of an NMOS pass-transistor, a key building block for FPGA circuitry, is different

than that of an NMOS in an inverter. Furthermore, the resistance of a pass-transistor is different

during rising and falling transitions due to the NMOS’s inability to propagate a full rising transition.

Using HSPICE simulations, we measure the resistances of an NMOS pass-transistor by charging and

discharging a large capacitor through a single pass-transistor (Figures 3.9b and 3.9c). Again, trise and

tfall are measured from VDD/2.

In Table 3.2, we compare the rising and falling resistance of a pass-transistor to the resistance of

an NMOS in an inverter for a 4× minimum-width transistor. We can clearly see that the resistance

of the NMOS in the inverter (3.8k) is different from the rising (13.7k) and falling (1.9k) resistances of

a pass-transistor. The very large rising resistance is caused by the pass-transistor’s degraded output

voltage. It is possible to achieve more balanced rising and falling resistances by measuring trise and


0 5 10 15 20 25 30 35Transistor Width (xMin. width)

650

700

750

800

850

900

950

Res

istiv

ity(Ωµ

m)

NMOSPMOS

Figure 3.10: Inverter NMOS and PMOS resistivity vs. transistor width.

Table 3.2: Resistance of a 4× minimum-width NMOS transistor for different circuit topologies (Figure3.9) and switching-thresholds.

Circuit Topology Transition Type Switching-Threshold Resistance (kΩ)

Chain of 5 inverters fall VDD/2 3.8Single pass-transistor fall VDD/2 1.9Single pass-transistor rise VDD/2 13.7Single pass-transistor fall VDD/3 2.7Single pass-transistor rise VDD/3 2.22 series pass-transistor fall VDD/3 2.82 series pass-transistor rise VDD/3 3.3

tfall at VDD/3 instead of VDD/2 which in terms of circuit design, corresponds to lowering the switching-

thresholds of downstream inverters by skewing their P/N ratios. As shown in Table 3.2, at VDD/3 the

rising and falling resistances of a pass-transistor are 2.2k and 2.7k respectively. Table 3.2 also shows

that the resistance of an NMOS in a chain of 2 series connected pass-transistors (Figures 3.9d and 3.9e)

is different from both the single pass-transistor and the inverter.

The results of Table 3.2 demonstrate that the custom pass-transistor based topologies of FPGA

circuitry do not lend themselves well to switch-level modeling. Not only does resistance depend on

circuit topology, it also depends on the switching-threshold of downstream inverters and on transistor

dimensions (Section 3.5.1). The complexity of a switch-level model sufficiently accurate for the type

of fine-grained transistor level design we wish to undertake is impractical so we rely solely on circuit

simulation to estimate delay.


3.6 Wire Load Modeling

Past FPGA transistor sizing efforts have often only accounted for the loading effects of long wires such

as the general routing wires or the cluster local interconnect wires (i.e. the cluster input wires and local

feedback wires of Figure 3.2). In reality, an FPGA contains much more metal wiring. Ignoring this extra

metal is increasingly problematic, as the delay impact of wires is becoming ever more important with

shrinking feature sizes [18]. Accordingly, COFFE models all wire loading, even including the relatively

short metal connecting two transistors inside a multiplexer. COFFE estimates wire lengths based on

area estimates obtained with the model of Section 3.4 along with the following set of general layout

assumptions.

1. The layout of a sub-block (e.g. a multiplexer, a BLE, a logic block, etc.) is assumed to be square

such that its width is equal to its height.

2. The length of a wire that broadcasts a signal across a sub-block is equal to the width (which equals

the height) of that sub-block.

3. The length of a point-to-point wire between two sub-blocks is equal to 1/4 the sum of the width

of both sub-blocks.

For example, cluster local interconnect wires are broadcast wires with a single source and many

destinations so they span the height of a logic block. Wires that connect two inverters together inside a

buffer are point-to-point wires; they span 1/4 the width of each inverter.

The resistance and capacitance of a wire are obtained from its length estimate, as well as its metal

layer. The resistance and capacitance values of each metal layer are inputs to COFFE. COFFE imple-

ments all wires in the same metal layer, with the exception of general routing wires, which are placed in

a separate metal layer. Placing the general routing wires on a different metal layer than the other wires

allows one to use a wider wire pitch when computing the resistance and capacitance values of the general

routing wires layer as these wires benefit from lower resistance. To speed up SPICE simulation, rather

than modeling a wire as a distributed RC transmission line, COFFE includes the equivalent π-model in

the generated SPICE netlists.

3.7 Transistor Sizing Algorithm

When transistors are treated as linear resistances and capacitances, the transistor sizing problem can be

formulated as a convex optimization problem [17]. Such a formulation has the highly useful property that

there is only one minimum: the global minimum. Past transistor sizing algorithms have exploited this

fact by either making a series of local optimizations in hopes of eventually reaching the global minimum

[17, 24] or by making use of mathematical programming techniques [44, 10, 47, 46]. In Section 3.5, we

showed that it is very difficult to obtain linear models of transistors that are sufficiently accurate for

the fine-grained transistor-level design of FPGA circuitry in advanced process nodes. Instead, we chose

to use HSPICE simulations to measure delay, which produces more accurate delay estimates, but also

makes the shape of the optimization space more ambiguous. Therefore, COFFE takes a more exhaustive

optimization approach and searches for a minimal cost solution by simulating many possible transistor

sizing combinations over a range of transistor sizes. Exhaustively searching the entire optimization space


in this way would lead to prohibitively long runtimes because there are ∼80 unique transistors to size

in one FPGA tile and sweeping each transistor over ∼10 sizes would require 1080 HSPICE simulations.

COFFE uses two techniques to confront this problem: divide-and-conquer and pre-determined P/N

ratios.

3.7.1 Divide-and-Conquer

COFFE reduces the transistor sizing combinations to examine by splitting the FPGA into loosely coupled

subcircuits S1, S2, ..., SN and sizing each Sj individually. This divide-and-conquer approach reduces

the search space but requires iteration to account for changes in loading. More specifically, since subcir-

cuits are usually loaded by other subcircuits, changing the transistor sizes of one subcircuit will change

the load on another. Because of this, COFFE performs multiple FPGA sizing iterations in which it sizes

each subcircuit Sj once with the loading coming from the last sizing of the other subcircuits. FPGA

sizing iterations are performed until no reduction in cost is achieved (implying loading has stabilized)

or until a maximum number of iterations have been completed. In our experience, COFFE finds a

transistor sizing solution after 2-4 iterations.

3.7.2 Pre-Determined P/N Ratios

COFFE further reduces the number of transistor sizing combinations evaluated by using pre-determined

P/N ratios to size the NMOS and PMOS transistors of inverters as a unit instead of as individual

transistors. Although COFFE chooses these P/N ratios intelligently, they may not always be the best

choice for every transistor sizing combination. Therefore, COFFE later re-computes the P/N ratios

of the best candidate transistor sizing solutions, ensuring that the P/N ratios of the final solution are

tailored to it.

3.7.3 Detailed Algorithm

The steps listed below describe how COFFE sizes a subcircuit (the step numbers match the labels in

Figure 3.11). To give a concrete example, we will assume that the subcircuit being sized (Sj) is a

two-level multiplexer such as the one shown in Figure 3.3a.

1. First, COFFE selects transistor sizing ranges for subcircuit Sj based on the initial sizes of transis-

tors in that subcircuit. These initial transistor sizes originate from one of two places. On the first

FPGA sizing iteration, the initial transistor sizes are their starting sizes, which are hard-coded in

COFFE and were chosen based on designer intuition (however, since COFFE sweeps transistor

sizes over many possible values, we believe that the transistor sizing algorithm is not very sensitive

to the starting sizes of transistors). On subsequent FPGA sizing iterations, the initial sizes are the

sizes obtained on the previous FPGA sizing iteration (i.e. the previous transistor sizing solution).

To explore the impact of both growing and shrinking transistor sizes, COFFE chooses transistor

sizing ranges that place the initial transistor sizes near the center. For example, if transistor lvl1

has an initial size of 6, COFFE would choose a transistor sizing range of 1→ 10 for this transistor

(assuming the size of this transistor is swept over 10 integer values). Note that, by default, COFFE

uses integer granularity when sweeping transistor sizes.


Subcircuit Sizing

FPGA Sizing Iteration

Yes

Yes

Yes

No No

No

FPGA

Yes

Find initial transistor sizing ranges for Sj

Sj,sol on range boundaries?

Equalize trise and tfall for mid-range Sj,i

Calculate area Aj,i for each Sj,i

Measure delay Tj,i = (trise+tfall)/2 for each Sj,i

Calculate cost Cj,i = cost(Aj,i,Tj,i) for each Sj,i

Sj = S1Split FPGA into

subcircuits S1, …, SN

Sj = Sj+1

Sj = SN?No more cost

reductions or max iterations?

Transistor sizing solution

Sj,ii = 1, 2, …, K = set of all transistor sizing combinations in sizing ranges

Sj,sol = Sj,i with minimum Cj,i

Equalize trise and tfall for M Sj,i with lowest Cj,i and update Tj,i = max(trise,tfall) and Cj,i

Adjust sizing ranges around Sj,sol

1

2

3

4

5

6

7

Figure 3.11: COFFE’s transistor sizing algorithm.

2. With the transistor sizing ranges chosen in step 1, COFFE creates a set Sj,ii=1,2,...,K of transistor

sizing combinations to explore. For example, there are four sizeable elements in the subcircuit of

Figure 3.3a: lvl1, lvl2, buf1 and buf2. If each transistor sizing range consists of 10 values, the set

Sj,i would consist of all 10, 000 possible transistor sizing combinations in these ranges.

3. Recall that, to reduce the search space, COFFE sizes the NMOS and PMOS of inverters as a unit

based on pre-determined P/N ratios. COFFE computes these P/N ratios by equalizing the rise

and fall times of the inverters of subcircuit Sj (buf1 and buf2) for a mid-range transistor sizing

combination. Although the transistor sizes in this mid-range combination can sometimes be the

same as the initial transistor sizes of step 1, they can also be different. For example, if the initial

size of transistor buf1 was 1, COFFE would choose a transistor sizing range of 1 to 5 as it is

impossible to make buf1 smaller than 1. Consequently, COFFE would use a size of 3 for buf1

when computing these P/N ratios.

4. Using the P/N ratios of step 3, COFFE calculates area (Aj,i) and measures delay (Tj,i) for each

transistor sizing combination (Sj,i) of the set created in step 2. Since the rise and fall times (trise

and tfall) will not remain perfectly balanced as we evaluate different transistor sizing combinations,

COFFE uses the average of the rise and fall times in this step (tavg = (trise + tfall)/2) because


Table 3.3: Rise-fall re-balancing and the effect of M on COFFE’s transistor sizing solutions (example).

(a) Six top-ranked transistor sizing combinations before re-balancing.

SizingCost tfall (ps) trise (ps)

Combination

A 1.000 161 163B 1.006 157 162C 1.010 168 165D 1.013 189 179E 1.017 163 164F 1.021 196 181

(b) Re-balancing the top-ranked solution (M = 1).


Combination

A 0.983 156 156

B 1.006 157 162C 1.010 168 165D 1.013 189 179E 1.017 163 164F 1.021 196 181

(c) Re-balancing the 5 top-ranked solution (M = 5).


Combination

B 0.982 153 153A 0.983 156 156E 0.987 157 157C 0.995 161 161D 1.052 181 180

F 1.021 196 181

in step 5, COFFE will re-balance the P/N ratios of candidate transistor sizing solutions and this

re-balancing typically makes the final delay closer to tavg than to max(trise, tfall). With its area

(Aj,i) and delay (Tj,i), COFFE computes the cost (Cj,i) of each transistor sizing combination (Sj,i)

based on the desired optimization objective.

5. COFFE sorts the transistor sizing combinations based on cost from lowest to highest and then

re-balances the rise and fall times on a user-specified M number of top-ranked transistor sizing

combinations. Since this re-balancing changes the area and delay of these M top-ranked transistor

sizing combinations, their area (Aj,i), delay (Tj,i) and cost (Cj,i) are updated. This final rise-fall

re-balancing may re-order the final ranking. This is exemplified in Table 3.3 where Table 3.3a

shows the six top-ranked transistor sizing combinations for our example subcircuit. Notice how

the rise and fall times are not all perfectly balanced due to the use of pre-computed P/N ratios.

Tables 3.3b and 3.3c show what happens to the final ranking when COFFE re-balances the rise and

fall times for M = 1 and M = 5 respectively. Two important observations must be made. First,

re-balancing the rise and fall times generally yields a more efficient solution. Second, re-balancing

many top-ranked transistor sizing solutions can change the final ranking. Note from Tables 3.3b

and 3.3c that after re-balancing, the delay of our example subcircuit (Figure 3.3a) is often smaller

than both the trise and tfall of Table 3.3a. This happens because there are two inverters in series

in this subcircuit and re-balancing the P/N ratios of both of them can improve both the rising and

falling transition times.

6. After re-balancing the rise and fall times on the M top-ranked transistor sizing combinations,

COFFE chooses the minimum cost solution Sj,sol from those M transistor sizing combinations.


7. Once the best cost sizing combination Sj,sol has been selected, COFFE checks if Sj,sol is on one

of the boundaries of the transistor sizing ranges of step 1. For example, if the sizing range of

transistor lvl1 was 1 → 10, and Sj,sol specifies that lvl1 = 10, then Sj,sol is on the boundary of

lvl1’s sizing range. If this is the case, we may not have explored a large enough size range as lvl1

may benefit from being made bigger than 10. Consequently, when Sj,sol is on a transistor sizing

range boundary, COFFE adjusts the transistor sizing ranges around Sj,sol (e.g. lvl1’s new sizing

range would be 5 → 14) and the process is repeated (return to step 2) until a solution that is

contained entirely within the sizing ranges is found, implying that we have searched a large enough

size range.

With divide-and-conquer and inverter rise-fall balancing, we reduce the number of transistor sizing

combinations to examine from ∼1080 to the much more tractable number of ∼3 × 12 × 104. That is,

for ∼12 subcircuits containing ∼4 sizeable items (transistors or inverters), we try ∼10 possible sizes per

sizeable item. This is done ∼3 times to account for changes in loading (i.e. an FPGA sizing iteration).

Total runtime is ∼4h for M = 1 or ∼10h for M = 5 on a single Intel Xeon E5-1620 3.6GHz processor

core.

3.8 Impact of Improved Wire Load Modeling

In this section, we examine the impact of improved wire load modeling on the area and delay of an

FPGA by using COFFE to perform transistor sizing under different wire loading scenarios. Section

3.8.1 describes the architecture that we will use to perform this analysis. This architecture will also be

the base architecture for our circuit design investigations in Chapter 4. Similarly, Section 3.8.2 describes

our target process technology for both this section and Chapter 4.

3.8.1 Base Architecture

For our circuit design investigations to be of value, it is important that we conduct our experiments on

FPGA architectures that are known to be good and that are relevant to current commercial FPGAs. We

use COFFE to perform the transistor-level design for all our experiments and thus our architecture style

matches the one described in Section 3.2. Unless stated otherwise, we use the architecture parameters

listed in Table 3.4. Most of these parameters were selected based on results from prior work or common

commercial practices (see Section 2.1). The number of cluster inputs (I) is set to 40 based on (2.1) plus 7

spare cluster inputs which are required to keep the sparsely populated local interconnect (Fclocal = 0.5)

highly routable [29]. Our routing channel width is set to W = 320 by adding 30% more routing tracks

to the minimum channel width required to route our biggest benchmark circuit. We chose this channel

width because some of our wire loading transistor sizing depends on the absolute channel width, and

it is common in commercial FPGAs to choose a channel width sufficient to route even difficult circuits.

Since the architecture we use is fairly different from prior work in terms of logic cluster outputs (e.g.

two outputs per BLE and single-driver routing wires), Fcout is determined experimentally. In Section

4.1, we show that for this architecture, an Fcout = 0.025W produces an FPGA with the best area-delay

product. Table 3.5 details the subcircuits per tile for this architecture.


Table 3.4: Base architecture parameters.

Parameter Value Parameter Value

K 6 FS 3N 10 Fclocal 0.5I 40 Rfb “on” for LUT-input CFcin 0.2 “off” for all other LUT inputsFcout 0.025 Rsel LUT & BLE inputW 320 Ofb 1L 4 Or 2

Table 3.5: Subcircuit count per tile for base architecture.

Subcircuit Size Count

Local routing multiplexers 25:1 60Connection block multiplexers 64:1 40Switch block multiplexers 10:1 160BLEs – 10SRAM cells – 3050

Table 3.6: Metal layer data used by COFFE for all circuit design investigations (ITRS [19]).

Metal Layer Half-Pitch Aspect RatioR C

(Ω/µm) (fF/µm)

Intermediate 24nm 1.9 54.825 0.175

Semi-global 48nm 2.12 7.862 0.215

Global 96nm 2.34 1.131 0.250

3.8.2 Target Process Technology

We wish to investigate a number of circuit design questions in advanced process technology. Therefore,

we use PTM 22nm HP predictive SPICE models [42] for all our investigations. The nominal supply

voltage in this process is VDD = 0.8V . We extract wire resistance and capacitance per unit length for

a 22nm process from ITRS 2011 [19]. The metal stack extracted from ITRS is shown in Table 3.6.

We implement all wires in the intermediate layer (minimum width and spacing) except for the general

routing wires which we implement in the semi-global layer (2× minimum width and spacing) for its lower

resistance.

3.8.3 Results

We begin by using COFFE to size the transistors of our FPGA without including the loading effects of

any wires. COFFE’s optimization objective is set to minimize the product of tile area and representative

path delay and we re-balance the rise and fall times of the 5 top-ranked transistor sizing combinations

(M = 5). The resulting tile area and representative path delay are shown in the first row of Table 3.7.

Then, we gradually add groups of wires to our FPGA, re-sizing its transistors after every addition. As

shown in Table 3.7, each time we add wires, we observe an increase in delay as well as an increase in tile

area because COFFE chooses larger transistor sizes in an effort to cope with the extra wire loading.


Table 3.7: Impact of wire loading.

Wire loadTile area Delay

(µm2) (ps)

No wires 836 58Inter-cluster routing only 899 79Inter-cluster routing & cluster local interconnect 905 85Inter-cluster routing, cluster local interconnect & logic-to-routinga 919 98All wires 938 112

aWe use an input track-access span of 0.5 and an output track-access span of 0.25 for logic-to-routingwires in this section. See Section 4.4 for definition of track-access span.

Table 3.7 clearly shows that it is important to account for the effects of more than just the inter-

cluster routing wires. In fact, 24% of the delay comes from two groups of wires that have often been

overlooked in prior work: logic-to-routing wires and smaller wires like those inside multiplexers and

lookup tables (which are included in the All wires row of Table 3.7). The logic-to-routing wires are those

that connect specific routing tracks to cluster inputs (through connection block multiplexers) as well

as cluster outputs to specific routing tracks (through switch block multiplexers) and they can span a

significant fraction of a tile. We study the impact of the lengths of these wires in more detail in Section

4.4.

3.9 Integration of COFFE with VPR

As illustrated in Figure 3.1, using COFFE with VPR enables more thorough architecture investigations.

To obtain the most accurate investigations, VPR’s area and delay models need to be aligned with the

enhancements made by COFFE. Therefore, the following code changes were made to VPR and will

appear in version 8.0.

1. VPR can calculate transistor area with (3.1), the improved area estimation equation, instead of

(2.2), the original equation. An option at the top of rr graph area.c, which must be set before

compilation, allows one to select which equation to use. VPR uses (3.1) by default.

2. VPR now assumes an area of 4 minimum-width transistors for SRAM cells instead of 6 (const

float trans sram bit = 4. in rr graph area.c).

3. A new option at the top of rr graph area.c allows VPR to optionally include track buffers in its

area calculations instead of always including them. By default, they are not included.

4. Similarly, an option allows VPR to optionally include track buffers in its loading and delay models.

The area and delay outputs of COFFE can be used to create VPR architecture files for architecture

exploration. In Section 4.4, we created a number of VPR architecture files for our circuit design exper-

iments. These architecture files are available at: http://www.eecg.utoronto.ca/~vaughn/software.

html. All architecture files describe an identical architecture (the base architecture of Section 3.8.1),

but they have different area and delays due to different circuit designs (e.g. switch type (pass-transistor

or transmission gate) and voltage levels (supply voltages and gate voltages)). These architecture files

are highly commented such that they are self-documenting.

http://www.eecg.utoronto.ca/~vaughn/software.html


Chapter 4

Efficient FPGA Circuitry

Circuit design is a crucially important part of obtaining efficient FPGAs. While the architecture of

an FPGA defines the style and flexibility of its resources, the circuit design of those resources is what

defines the area, delay and power of the architecture. In Chapter 3, we described COFFE, an automated

FPGA transistor sizing tool that produces the accurate area, delay and power estimates of properly sized

FPGA circuitry needed during architecture exploration. However, COFFE can be used in a different

role. An automated transistor sizing tool also enables design space exploration of FPGA circuitry as

well as investigations relating to the interaction between architecture and circuit design. In this chapter,

we use COFFE to explore a number of such circuit design related questions. Unless stated otherwise,

the base architecture described in Section 3.8.1 and the process technology data described in Section

3.8.2 are used for the experiments in this chapter

4.1 Fcout for Single-Driver Routing and Multiple BLE Outputs

Previous work has shown that Fcout = W/N is an appropriate cluster output pin flexibility [7], which

for our architecture would lead to Fcout = 0.1W . However, our cluster output architecture differs

significantly from that of [7]: it has two outputs per BLE and single-driver routing wires. Therefore, we

re-investigate cluster output pin flexibility. The area tradeoffs are as follows. Smaller Fcout values lead

to smaller switch block multiplexers as there are fewer connections from the cluster outputs to routing

wires. However, larger channels are needed due to poorer routability, leading to a larger number of

switch block multiplexers. The delay tradeoffs are similar. Smaller values of Fcout reduce loading and

lead to faster cluster outputs but might lead to circuitous routing.

We use VPR to place and route the MCNC benchmark circuits [53] on three architectures with differ-

ent values of Fcout to find an equally routable channel width for each architecture (i.e. same W/Wmin

where Wmin is the average minimum channel width required to successfully route the benchmarks).

Table 4.1 shows the number of switch block multiplexers per tile required for each architecture as well as

the size of these multiplexers. We use COFFE to size the transistors of all three architectures of Table

4.1. Then, we place and route the MCNC benchmark circuits for each architecture using the channel

widths of Table 4.1 to obtain the tile areas and critical path delays shown in Table 4.2. Based on these

results, we find that Fcout = 0.025W gives the best area-delay product for our N = 10, K = 6 and

Fcin = 0.2W architecture. We did not explore values of Fcout smaller than 0.025 because very low

35

Chapter 4. Efficient FPGA Circuitry 36

Table 4.1: Effect of Fcout on channel width and switch block multiplexers.

Fcout W # of SB MUXes per Tile SB MUX Size

0.250 288 144 19:10.100 296 148 13:10.025 320 160 10:1

Table 4.2: Area and delay for different Fcout values.

Fcout WTile Area Critical Path Delay

Area-Delay Product(µm2) (ns)

0.250 288 936 7.96 1.000.100 296 891 7.83 0.950.025 320 873 7.84 0.92

cluster output pin connectivities can be problematic for some channel widths. That is, if Fcout is too

low, it might not be possible to drive all starting wires in an adjacent switch block with a logic block

output. For the architecture that we use in this work, an Fcout = 0.025 results in each starting wire in a

switch block being driven by a single logic block output. Since single-driver routing reduces the portion

of a routing channel that can be accessed by logic cluster outputs to W/L, it seems intuitive that Fcout

should be lower than it is for architectures with tri-state driver routing [7] where the whole channel is

accessible. Also, prior work has not modeled Fcout dependent wire loading in detail, while COFFE lets

us take this detailed interaction into account in our architecture study. For all other experiments in this

work, we use an Fcout = 0.025W .

4.2 Transmission Gate FPGAs

4.2.1 Pass-Transistor Scaling Challenges

Although they are incapable of fully passing logic-high voltages, NMOS pass-transistors have been widely

used in commercial and academic FPGA circuitry due to the very small switch they enable. To benefit

from the area advantage of pass-transistors without suffering from an excessive amount of static power

dissipation due to their reduced voltage swing, FPGA circuitry typically includes a combination of

gate boosting and PMOS level-restorers to help pull degraded pass-transistor output voltages up to

VDD (see Section 2.3.2). However, as process technology scales, pass-transistor output voltages become

increasingly degraded due to the voltage scaling trends illustrated in Figure 4.1. That is, as process

technology scales, VDD is continually scaled down to reduce dynamic power dissipation and to keep

electric fields on shrinking feature sizes within acceptable bounds. To maintain good performance, VTh

is also scaled down, though at a slower rate than VDD to keep leakage current from growing too large

[6]. Recall from Section 2.3.2 that the logic-high output of a pass-transistor is degraded by a voltage

equal to VTh. Therefore, as VTh becomes an increasing fraction of VDD (Figure 4.1 shows that VTh/VDD

rose from ∼0.2 to ∼0.4 between 1997 and 2009), pass-transistor output voltages become increasingly

degraded. For the 22nm process we use in this work for example, the output of a non-gate boosted

pass-transistor switches only between 0V and 0.55V, whereas VDD is 0.8V. In addition, the waveform

slew rate rising above 0.45V is very slow. Consequently, the inverter sensing this signal (whose input


1996 1998 2000 2002 2004 2006 2008 2010Year

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Volta

ge(V

)

VDDVTh

Figure 4.1: VDD and VTh scaling trends [12].

can remain near VDD/2 for some time) can experience a high short-circuit current and a slow switching

speed. Furthermore, recent work has shown that pass-transistor based FPGAs are very sensitive to

aging induced by positive bias temperature instability which has become larger with the new high-k gate

dielectrics [23, 5].

To increase the pass-transistor output voltage, one can apply larger amounts of gate boosting, but

this poses a reliability risk as larger VGS values accelerate device aging. Furthermore, the latest high-k

gate processes do not offer a “mid-oxide” thickness transistor; such transistors were available in 90nm

through 40nm conventional oxide processes to give a reduced gate leakage transistor option to designers

[48]. These mid-oxide thickness transistors were excellent pass-gates as their thicker oxide allowed a high

level of gate boosting without compromising reliability. With PMOS level-restorers the issue is one of

robustness. A VTh that is a larger fraction of VDD means it takes longer for level-restorers to turn on

(which increases short-circuit currents) or, in the extreme case, they might not turn on at all. Reliability

concerns, a higher susceptibility to device aging, performance degradation and increasing short-circuit

power dissipation make the pass-transistor an increasingly less desirable switch.

4.2.2 Replacing Pass-Transistors with Transmission Gates

One solution to the pass-transistor scaling problem detailed in the previous section is to stop using

pass-transistors entirely and instead build FPGAs out of CMOS transmission gates. Transmission gates,

which consist of an NMOS transistor and a PMOS transistor in parallel, are capable of passing a full

rail-to-rail voltage swing, making them more robust than pass-transistors at low VDD. Although the

idea of using transmission gates in certain parts of the FPGA circuitry is not new [41, 27], building

FPGAs entirely out of transmission gates has typically been avoided because the area of a transmission

gate-based FPGA would be significantly larger than that of a pass-transistor-based FPGA. However,

there is no prior work that quantifies how much larger a transmission gate FPGA would be, and since


transmission gates are faster than pass-transistors due to their full voltage swing, it is unclear where in

the area-delay optimization space a fully transmission gate-based FPGA would fall in relation to a fully

pass-transistor-based FPGA. In this work, we locate them both in advanced process technology (with

PTM 22nm high-performance models [42]) by designing each type of FPGA from scratch, complete with

architectural design, circuit design and detailed transistor sizing.

To ensure our comparison is accurate, we choose an identical architecture for both our pass-transistor

and transmission gate FPGAs; specifically, the one described in Section 3.8.1. Since we use COFFE to

perform the transistor-level design of each FPGA, the circuit topologies for our pass-transistor FPGAs

are those described in Section 3.3. COFFE uses similar circuit topologies for our transmission gate

FPGAs but replaces pass-transistors with transmission gates. Figure 4.2b shows a transmission gate

implementation of a generic two-level routing multiplexer. Note that transmission gate routing multi-

plexers do not include PMOS level-restorers because transmission gates can pass a full voltage swing.

Our transmission gate 6-LUT topology is similar to the pass-transistor topology shown in Figure 3.4

but pass-transistors are replaced with transmission gates and the level-restoring PMOS transistors are

removed.

4.2.3 Gate-Boosting Strategy

Commercial FPGAs have often used a voltage greater than VDD on the gates of pass-transistors, which

is often called gate boosting. In addition to bringing the degraded logic-high outputs of pass-transistors

closer to VDD to mitigate static power dissipation concerns (see Section 2.3.2), gate boosting improves

performance. That is, the more VG is boosted above VDD, the faster a pass-transistor circuit will become

due to faster and larger swinging pass-transistor outputs. Therefore, a thorough comparison of pass-

transistor and transmission gate FPGAs should include an analysis of the effect of gate boosting both

switch types because even though transmission gate FPGAs do not suffer from degraded outputs, gate

boosting will improve their performance.

Gate boosting a routing multiplexer is achieved by connecting the SRAM cells to separate power

and/or ground rails (VSRAM+ and VSRAM− in Figure 4.2). Setting VSRAM+ above VDD will effectively

apply a higher voltage to the gates of NMOS transistors inside the multiplexer (provided the cell contains

a logic-high value). In addition to increasing VSRAM+, transmission gate FPGAs can set VSRAM− below

0V to improve PMOS transistor performance. Since SRAM cells only switch at configuration time, gate

boosting a routing multiplexer does not increase dynamic power consumption and high-VTh, low-leakage

transistors can be used in the SRAM cells to minimize static power consumption (their speed is not

important). Through HSPICE simulation, we found that boosting the voltage by 200mV on an SRAM

cell built from PTM 22nm low-power transistors increased its static leakage by 3.6×. However, the

SRAM contribution to the chip-wide static power consumption remained below 2mW1.

Gate boosting a lookup table can be accomplished by running the LUT input drivers (see Figure

3.4) at a higher voltage. This would increase dynamic power consumption as the LUT inputs toggle

frequently during device operation and would also require level-converters on the time critical LUT input

path. Gate boosting lookup tables in thus both more complex and less beneficial than gate boosting

routing multiplexers. Therefore, we do not gate boost lookup tables.

Too much gate boosting will cause faster aging by accelerating time-dependent dielectric breakdown

1Increasing VSRAM+ from 0.8V to 1.0V increased a tile’s total SRAM leakage from 14.06nA to 51.08nA. Large modernFPGAs can have as many as 36, 000 tiles [4]. Therefore, total chip leakage is 51.08nA×1.0V × 36, 000 = 0.0018W .


Level-restorer

out

SRAM cell

VG

VSRAM+

VSRAM-

WL

BLBL

buf1

buf2

lvl1 lvl1

lvl2

Pass-transistor 2-level MUX 2-stage bufferSRAM details

(a) Pass-transistor implementation of a two-level routing multiplexer.

VG

VSRAM+

VSRAM-

WL

BLBL

out

buf1

buf2lvl1 lvl1 lvl2

SRAM details Transmission gate 2-level MUX 2-stage buffer

(b) Transmission gate implementation of a two-level routing multiplexer.

Figure 4.2: Generic two-level routing multiplexer with two-stage buffer implemented with pass-transistors(a) and transmission gates (b).


Gate Voltage (NMOS/PMOS)

0

5

10

15

20

Del

ayR

educ

tion

(%)

0.9/

0.0

1.0/

0.0

NMOS Only

0.8/

-0.1

0.8/

-0.2

0

5

10

15

20PMOS Only

SRAM Overstress

0.1V0.2V0.3V0.4V

0.9/

-0.1

0.9/

-0.2

1.0/

-0.1

1.0/

-0.2

0

5

10

15

20NMOS & PMOS

Figure 4.3: Effect of different gate boosting strategies on transmission gate switch block multiplexerdelay (VDD = 0.8V ).

and bias-temperature instability or could even destroy the transistor. It is difficult to determine exactly

how much gate boosting is tolerable without compromising reliability, particularly for newer processes,

making it an active topic of investigation [32, 11]. Thus, since it is unclear exactly how much gate

boosting is safe for a 22nm process, we sweep the gate voltage over three values (VDD, VDD + 0.1V and

VDD + 0.2V ) thus providing a general indication of the effect of gate boosting from which a safe gate

boosting level can be chosen.

One final question must be answered before we have a complete gate boosting strategy: how do we

gate boost the transmission gates? A transmission gate can be gate boosted by applying a voltage larger

than VDD on the gate of the NMOS transistor, by applying a voltage smaller than 0V on the gate of

the PMOS transistor or by a mixture of both. It isn’t immediately clear which one of these options

is best. Therefore, we experiment with different levels of gate boosting on our completely optimized,

non-gate boosted, transmission gate FPGA design. Figure 4.3 shows the delay reductions observed in

the switch block multiplexers; results for other multiplexers follow the same trend. Gate boosting only

the NMOS transistor (leftmost bar graph) results in almost twice the delay reduction that is obtained

when only the PMOS transistor is gate boosted and results in nearly the same amount of delay reduction

obtained when both transistors are gate boosted. Therefore, we choose to only gate boost the NMOS

transistors of transmission gates since the additional delay reduction achieved by also gate boosting the

PMOS transistors probably does not merit the creation of a new supply plane. As well, some transistors

in the configuration SRAMs will be subjected to a voltage difference of VSRAM+ − VSRAM−. Hence,

simultaneously gate boosting both NMOS and PMOS transistors by some voltage increases the reliability

risk versus gate boosting only the NMOS transistors by that voltage. Bars of the same color in Figure

4.3 have the same stress on the SRAM cells.

Thus, our gate boosting strategy is the following. We gate boost the NMOS transistor of routing

multiplexers (but not LUTs) for both pass-transistor and transmission gate FPGAs. This is accomplished

by increasing the VSRAM+ voltage, which we sweep over three values: VDD, VDD+0.1V and VDD+0.2V .


Place and route benchmarks with VPR

COFFE v0.1

HSPICE + PTM 22nm

Area Model

Wire Model

Architecture and circuit design

Critical Path Delay

Subcircuit usage count

Delay per subcircuit

Transistor sizes

Power per subcircuit

Power

Power calculations

VPR arch. files

Tile area calculations

Tile Area

Transistor-Level Design

Measurement

Figure 4.4: CAD flow for each FPGA.

4.2.4 Methodology

Our goal is to examine how the area, delay and power of transmission gate FPGAs compare to that of

pass-transistor FPGAs. In the previous section, we chose a gate boosting strategy that involves sweeping

the gate voltage over three values. Consequently, we will look at six different FPGAs representing

three levels of gate boosting for both pass-transistor and transmission gate switches. All six FPGAs

have identical architectural parameters (described in Section 3.8.1) but they differ in circuit design.

Throughout the remainder of this section, we refer to these FPGAs as implementations. Figure 4.4

shows the CAD flow used to obtain tile area, critical path delay and dynamic power for each FPGA

implementation.

To obtain a fair comparison, we must first size the transistors of each FPGA implementation such

that all are optimized for a common objective (top portion of Figure 4.4). Note that it is important to

perform transistor sizing for each level of gate boosting because gate boosting affects the voltage-transfer

characteristics of the circuits. We use COFFE2 to size the transistors of our six FPGA implementa-

tions. The optimization objective is set to minimize area-delay product and we optimize each subcircuit

individually (local optimization). When transistor sizing is complete, COFFE yields the final transistor

sizes along with the delay and dynamic power of each subcircuit for each FPGA implementation. Dy-

namic power is obtained for each subcircuit by using HSPICE to measure the average current required

to propagate a rising and a falling transition through the subcircuit and then multiplying it by VDD.

2This work was performed with an earlier version of COFFE than the one described in Chapter 3. In this version,COFFE did not account for the extra area needed for N-well spacing. That is, the area model consisted of only Equation3.1. For our transmission gate FPGAs, this difference in area modeling implicitly assumed that the extra PMOS transistorscan be placed in existing N-wells. If this is not possible and additional N-wells are required, we estimate that transmissiongate FPGA area would increase by at most 7%, which would not significantly affect our overall conclusions.


Table 4.3: Pass-transistor and transmission gate FPGA tile area for different levels of gate boosting.

VGTile area (µm2)

TG/PTPass-transistor Transmission gate

VDD 875 1006 15.0%VDD + 0.1V 873 1010 15.7%VDD + 0.2V 887 1015 14.5%

Table 4.4: Switch block multiplexer transistor sizes for PT and TG implementations for different levels ofgate boosting (see Figure 4.2 for transistor labels). Note that with the exception of P/N ratios, COFFEuses integer granularity.

Type VGlvl1 lvl2 buf1 buf2

PMOS NMOS PMOS NMOS PMOS NMOS PMOS NMOS

PT VDD + 0.0 - 3 - 3 3 3.2 20.7 11PT VDD + 0.1 - 2 - 2 7.0 3 31.6 12PT VDD + 0.2 - 2 - 2 12.6 3 37.5 14TG VDD + 0.0 1 1 1 1 3 4.3 35.7 21TG VDD + 0.1 1 1 1 1 4 4.3 39.9 19TG VDD + 0.2 1 1 1 1 5.6 4 44.9 19

Once all FPGA implementations have been optimized, tile area, critical path delay and power can

be measured (bottom portion of Figure 4.4). Tile area is obtained by first calculating the area of each

FPGA subcircuit based on the final transistor sizes obtained from COFFE. Then, the subcircuit areas

are multiplied by the number of subcircuits in a tile (Table 3.5) and summed to obtain total area. A

VPR architecture file is created for each of the six FPGA implementations with the delay-per-subcircuit

values obtained from COFFE. Critical path delay is measured experimentally with VPR by placing and

routing MCNC [53] and VTR [43] benchmarks on each FPGA for five different placement seeds. To

compute relative total power, we multiply the power-per-subcircuit numbers by the average number

of times each subcircuit is used in VPR placed and routed benchmarks. Since we are only interested

in a relative power comparison between our six FPGA implementations, we do not need to perform a

functional simulation to obtain toggle activities as we expect them to be the same across implementations

except for very slight glitch changes due to small variations in timing.

4.2.5 Results

Table 4.3 shows the tile area for pass-transistor (PT) and transmission gate (TG) FPGAs with differ-

ent levels of gate boosting. VDD is 0.8V , the nominal VDD for this process, throughout this section.

The results indicate that transmission gate FPGAs are approximately 15% larger than pass-transistor

FPGAs. At first glance, a 15% tile area increase may seem surprisingly small as building FPGAs out

of transmission gates instead of pass-transistors doubles the number of transistors per switch in the

routing multiplexers and lookup tables, which make up a large fraction of an FPGA tile. There are

three contributing factors to this modest area increase. First, although transmission gate FPGAs re-

quire two transistors for each switch instead of just one, the area of each switch is not doubled due to

differences in transistor sizing. That is, the area of a transmission gate is usually equal to 2 minimum-

width transistor areas, as COFFE usually finds that minimum-size transmission gates (minimum size


Table 4.5: Pass-transistor and transmission gate FPGA critical path delay for different levels of gateboosting (VTR benchmarks).

VGCritical path delay (ns)


VDD 23.3 17.4 -25.4%VDD + 0.1V 18.9 15.8 -16.3%VDD + 0.2V 16.2 14.4 -10.7%

Table 4.6: Pass-transistor and transmission gate FPGA area-delay product for different levels of gateboosting (VTR benchmarks).

VGArea-delay product


VDD 1.00 0.86 -14%VDD + 0.1V 0.81 0.78 -3%VDD + 0.2V 0.70 0.72 2%

Table 4.7: Pass-transistor and transmission gate FPGA relative dynamic power for different levels ofgate boosting (VTR benchmarks).

VGRelative dynamic power


VDD 1.00 1.04 3.8%VDD + 0.1V 0.99 1.05 6.4%VDD + 0.2V 1.02 1.06 4.4%

NMOS and minimum size PMOS) are most efficient while the area of a pass-transistor is usually 1.26

to 1.51 minimum-width transistor areas, as COFFE usually sizes pass-transistors 2 to 3 times larger

than the minimum size (see Appendix C). Second, SRAM cell area is constant for both types of FPGAs

and accounts for a large fraction of tile area (see Section 4.2.6). Finally, transmission gate FPGA area

benefits from the removal of PMOS level-restorers, which are required in pass-transistor FPGAs but not

in transmission gate FPGAs.

Gate boosting does not significantly affect tile area. In general, we noticed that as the level of gate

boosting is increased on pass-transistor FPGAs, our transistor sizing tool tends to reduce pass-transistor

sizes but increases buffer sizes resulting in an FPGA that has similar tile area but reduced delay. Due

to their larger area, our transistor sizing tool almost always chooses minimum sized transmission gates.

The buffers in transmission gate FPGAs are larger than those of pass-transistor FPGAs due to more

transistor and wire loading. The P/N ratios of buffers are also different for different levels of gate

boosting, as the signal swings at the buffer inputs are changing. Table 4.4 shows the transistor sizes

for a switch block multiplexer in units of minimum contactable transistor width (45nm in this 22nm

process). Transistor sizes for all subcircuits are given in Appendix C.

Table 4.5 shows average critical path delay for all 6 FPGA designs for the VTR benchmark set

(MCNC benchmarks yielded similar results and hence results are not shown). The results show that,

with no gate boosting, transmission gate FPGAs are 25% faster than pass-transistor FPGAs. As the


SB MUX 31.3%

CB MUX 22.0%

Local MUX 16.4% FF

1.2%

Cluster Output 2.2%

LUT 26.9%

(a) Tile area breakdown.

SB MUX 36.7%

CB MUX 14.5%

Local MUX 16.2%

FF 0.4%

Cluster Output 4.7%

LUT 25.4%

Other 2.0%

(b) Critical path delay breakdown.

Figure 4.5: Tile area and critical path delay breakdown.

level of gate boosting is increased, the delay gap is reduced but transmission gate FPGAs remain faster.

The higher speed with transmission gates is due to the increased multiplexer output voltage swing and

the fact that we now have two switch transistors in parallel, providing lower resistance. The resistance of

transmission gates is further reduced in advanced processes because highly strained silicon has narrowed

the gap between PMOS and NMOS carrier mobility. For example, in the 22nm process we use, PMOS

transistor drive strength is 66% of the NMOS transistor drive strength for the same width. In older

process nodes (e.g. 0.35µm), PMOS transistor drive strength was only 37% that of NMOS transistors

[7].

The area-delay product for each FPGA design is given in Table 4.6. With no gate boosting, transmis-

sion gate FPGAs have an area-delay product that is 14% lower than pass-transistor FPGAs. However,

given the right amount of gate boosting (in this case somewhere between +0.1V and +0.2V), pass-

transistor FPGAs eventually become more efficient than transmission gate FPGAs.

Table 4.7 shows dynamic power, normalized to the non-gate boosted pass-transistor FPGA imple-

mentation. Transmission gate FPGAs consume slightly more power than pass-transistor FPGAs. This

is likely due to their larger tile area. The small decrease in power consumption experienced by pass-

transistor FPGAs with 0.1V of gate boosting is due to reduced short-circuit current. With 0.2V of gate

boosting however, the gains from reduced short-circuit current are lost due to the power increase from

higher voltage swings in the internals of the pass-transistor multiplexers.

4.2.6 Area and Delay Breakdown

Figure 4.5a shows the area contributions of different FPGA subcircuits for our pass-transistor FPGA

with 0.1V of gate boosting. Approximately 28% of the area is devoted to BLEs (LUT + FF) leaving 72%

of the area to routing. This number is lower than the 90% routing area commonly quoted in academic

work (e.g. [31]), but is higher than the commercial Stratix V architecture where routing area is said

to account for only 50% of tile area [35]. This discrepancy could be due to our architecture having

fewer features than commercial architectures (e.g adders, more complex FFs, LUTRAM, etc.). SRAM

cells account for 40% of tile area for pass-transistor FPGAs and 35% of tile area for transmission gate

FPGAs.


0.6 0.7 0.8VDD (V)

10

20

30

40

50

60

Crit

ical

Pat

hD

elay

(ns)

PT, VG=VDDPT, VG=0.8VTG, VG=VDDTG, VG=0.8V

Figure 4.6: Critical path delay for pass-transistor (PT) and transmission gate (TG) FPGAs for differentVDD and VG.

The critical path contributions for our pass-transistor FPGA with 0.1V of gate boosting are shown

in Figure 4.5b. Approximately 26% of the critical path delay comes from the BLEs, 72% comes from

the routing and 2% comes from hard multipliers and block memory (where we use Stratix IV-like delay

values). The area and delay breakdowns for our other pass-transistor and transmission gate FPGAs

follow the same trends and are given in Appendix D.

4.3 Separating VDD and VG for Low-Power FPGAs

An FPGA that employs adaptive voltage scaling can trade delay for power by using an operating VDD

that is lower than its nominal supply voltage (VDDn). To reduce the delay penalty without adversely

affecting power, the resulting low-power FPGA can mimic the concept of gate boosting by lowering VDD

but not VG. What is particularly interesting about decoupling VDD and VG in this way is the fact that,

as long as VG <= VDDn, “gate boosting” low-power FPGAs does not pose a reliability risk as it does

for FPGAs running at VDDn where any amount of gate boosting results in VG > VDDn.

We explore the idea of adaptive voltage scaling with different VDD and VG on our non-gate boosted

pass-transistor and transmission gate FPGA implementations from Section 4.2 (that have been fully

optimized for VDD = 0.8V ) by experimenting with two low-power FPGA schemes. In the first, VDD and

VG are kept equal and are both lowered below 0.8V to produce a low-power mode. In the second, VG

is maintained at 0.8V and only VDD is lowered, resulting in a “gate boosted” low-power mode. Figures

4.6 and 4.7 show critical path delay and dynamic power (normalized to the pass-transistor FPGA with

VDD = VG = 0.8V ) respectively for both schemes. The results show that lowering VDD and VG to

0.6V results in a 2× power reduction for both pass-transistor and transmission gate FPGAs but a 6×and 2.5× increase in delay respectively. However, if we maintain VG at 0.8V when VDD is lowered

to 0.6V, pass-transistor and transmission gate FPGA delays improve by 65% and 18% respectively at

no additional power cost. Clearly pass-transistor FPGAs are a very poor choice for low-power if gate

voltages are not maintained at VDDn.


0.6 0.7 0.8VDD (V)

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Nor

mal

ized

Pow

er


Figure 4.7: Dynamic power for pass-transistor (PT) and transmission gate (TG) FPGAs for differentVDD and VG.

0.6 0.7 0.8VDD (V)

5

6

7

8

9

10

11

12

13

14

Pow

er-D

elay

Pro

duct


Figure 4.8: Power-delay product for pass-transisor (PT) and transmission gate (TG) FPGAs for differentVDD and VG.

Figure 4.8 shows that decoupling VDD and VG for low-power FPGAs is very beneficial. If we maintain

VG at 0.8V, the VDD yielding minimal power-delay product shifts from 0.8V to 0.7V where we experience

a 25% power reduction. In addition, the results indicate that transmission gate FPGAs always achieve

lower power-delay product than pass-transistor FPGAs in the low-power regime with a 26% advantage

at 0.6V.


Logic Cluster

Wire load spans ½ tile (locality)

Cluster output A

Switch block multiplexer

Cluster output B

Wire load spans 1 tile (no locality)

Routing Channel

Figure 4.9: Cluster output wire load for different locality.

Logic Cluster

MUX input wire load can span up

to ½ tile

Cluster input A

Cluster input B

MUX input wire load can span

up to 1 tile Ro

uti

ng

Ch

ann

elFigure 4.10: Cluster input wire load for different locality.

4.4 Track-Access Locality

In Section 3.8, we showed that wire loading at the logic-to-routing interface has a considerable impact

on delay. Prior academic work has implicitly assumed that logic cluster pins can access all the routing

tracks in an adjacent channel but has not considered the large logic-to-routing wire loading that this

creates. It is possible to reduce this wire load by imposing limits on the lengths of logic-to-routing wires.

We refer to this concept as track-access locality and we define track-access span as the portion of a routing

channel that can be accessed by a logic cluster input or output. A large span implies little locality and

vice-versa. Figure 4.9 illustrates this concept for logic cluster outputs. In the figure, output A can only

reach half of the routing tracks in a channel (the 50% physically close to it) while output B can reach

all of them. Output A has a track-access span of 1/2; output B has a track-access span of 1. Clearly,

output B has twice as much wire load as output A. Thus, output A is faster than output B. Figure 4.10

illustrates the same concept as it applies to logic cluster inputs. The wire loading associated with cluster

inputs comes from the wires required to connect routing tracks to the connection block multiplexers.

This wire loading is seen by the routing wire drivers and will tend to slow down the general routing

wires.

Track-access locality should not be confused with Fcin and Fcout. While Fcin and Fcout specify the


Table 4.8: Effect of cluster output track-access locality on area and delay. Input track-access span is setto 0.5.

Cluster Output Tile Area Delay Area-DelayTrack-Access-Span (µm2) (ps) Product

1.00 959 113 1.080.75 934 115 1.070.50 930 114 1.060.25 938 112 1.05

Table 4.9: Effect of cluster input track-access locality on area and delay. Output track-access span isset to 0.25.

Cluster Input Tile Area Delay Area-DelayTrack-Access-Span (µm2) (ps) Product

1.00 952 127 1.210.75 953 117 1.110.50 938 112 1.050.25 955 105 1.00

number of tracks to which a logic cluster input or output will connect, track-access span specifies the

fraction of the routing channel in which the logic cluster input or output is authorized to make these

connections. Consequently, track-access span defines an upper bound on Fcin and Fcout. For example,

if the cluster input track-access span is set to 0.25, Fcin cannot be made larger than 0.25, because when

Fcin = 0.25, a cluster input pin already connects to every routing track to which is has access.

We use COFFE to size the transistors of the FPGA architecture described in Section 3.8.1 for

different degrees of track-access locality. Table 4.8 shows the effect of cluster output locality on tile

area and representative path delay while Table 4.9 shows results for cluster input locality. The results

suggest that reducing the input track-access span can lead to a large reduction in loading (∼17% delay

reduction for a span of 0.25). The effect is lesser for cluster outputs but we still observe a small reduction

in overall area-delay product. Although track-access locality seems beneficial from a delay perspective,

it could have a negative impact on routability since increasing locality could reduce the interconnect

flexibility. It follows that the ideal track-access span will likely also depend on the values of Fcin and

Fcout. For example, for our Fcin = 0.2 and Fcout = 0.025 architecture, cluster outputs may be better

suited for high locality given the fact that they connect to relatively few routing multiplexers due to

a low Fcout value. A detailed analysis of these tradeoffs was not performed in this work but merits

future research. VPR currently does not support track-access locality – it spreads switches across the

entire routing channel. Therefore, VPR code changes would also be necessary to investigate track-access

locality thoroughly. When used with an architecture exploration tool such as VPR, COFFE enables

a thorough evaluation of such architectural issues which combine changes in connectivity, loading and

transistor sizing.

Chapter 5

Conclusions and Future Work

5.1 Summary

The transistor-level design of an FPGA has a large impact on its area, delay and power characteristics.

High-quality transistor-level design is thus important to obtain efficient FPGAs. In addition, transistor-

level design is necessary for conducting thorough architecture exploration where the transistor-level

design of multiple architectures must be performed before evaluating them through placement and

routing experiments. Automated transistor-level design tools are therefore invaluable in creating highly

efficient FPGA architecture implementations in reasonable amounts of time.

In this thesis, we developed COFFE, a new fully-automated transistor sizing tool for FPGAs1. We

showed that for fined-grained transistor-level design in advanced process nodes, modeling transistors as

linear resistances and capacitances as in previous transistor sizing tools is highly inaccurate. For that

reason, COFFE maintains all circuit non-linearities by relying exclusively on HSPICE simulations to

measure delay. COFFE estimates area with a version of the minimum-width transistor area model to

which we’ve made a number of improvements to enhance its accuracy in advanced process nodes. We

showed that only accounting for the loading effects of long wires as has been done in prior work can lead

to delay under-predictions of 24%. To ensure realistic transistor sizing, COFFE automatically models all

wire loads, without requiring manual layout. These enhanced models have an important architectural

impact: they favor larger transistors in FPGA lookup tables and multiplexers.

In the second part of this thesis, we used COFFE to investigate a number of FPGA circuit design

related questions. First, we re-investigated logic block output pin flexibility (Fcout) as this is an architec-

tural question that hasn’t been fully investigated for single-driver routing architectures and multi-output

BLEs. We found that for an N = 10, K = 6 architecture, an Fcout = 0.025W yields an FPGA with

better area delay product than the Fcout = 0.1W recommended by prior work.

Second, we compared the area, delay and power of transmission gate-based FPGAs to those of

pass-transistor FPGAs in 22nm process technology as pass-transistor performance and reliability have

been degrading with technology scaling. We showed that transmission gate FPGAs consume 15% more

area than pass-transistor FPGAs but are 25%, 16% and 10% faster for 0V, 0.1V, and 0.2V of gate

boosting respectively. In terms of area-delay product, transmission gate FPGAs are 14% better than

pass-transistor FPGAs without gate boosting but 2% worse with 0.2V of gate boosting. Clearly, if gate

1COFFE is available online at: http://www.eecg.utoronto.ca/~vaughn/software.html

49


Chapter 5. Conclusions and Future Work 50

boosting is not permitted, building FPGAs out of transmission gates is the better choice. However, given

enough gate boosting, pass-transistor FPGAs are still more efficient. Even if 0.2V of gate boosting is

safe, however, a case can be made for transmission gate FPGAs due to the reliability concerns associated

with pass-transistors in advanced process technology as they incur only a 2% area-delay product and

5% power increase.

Third, we explored the idea of using separate VDD and VSRAM+ voltages for low-power FPGAs. We

found that maintaining VSRAM+ at the nominal VDD of 0.8V when lowering VDD to 0.6V to reduce power

helps reduce the delay penalty associated with low-power operation. We also found that transmission

gate FPGAs always have a better power-delay product than pass-transistor FPGAs. Therefore, if low-

VDD operation is desired, transmission gate FPGAs that maintain VSRAM+ at the nominal VDD yield

the FPGAs with the best power-delay product.

Finally, we investigated a new architectural question concerning the wire loading at the interface

between routing channels and logic blocks. We found that, at a possible cost in routability, restricting

the portion of a routing channel that can be accessed by a logic block input can reduce delay by up to

17%.

5.2 Future Work

COFFE enables three categories of future work. The first involves using COFFE alongside VPR to

perform architecture exploration in advanced process nodes. Architecture investigations into the best

architecture parameter values such as lookup table size and logic cluster size have been performed

before. However, as process technology scales, the characteristics of the underlying circuitry may change,

causing the answer to these architectural questions to change as well. COFFE makes re-visiting these

questions much easier as it provides automated transistor-level design for each architecture. Architectural

extensions to COFFE also fall into this category of future work. For example, COFFE could be extended

to support fracturable LUTs and carry chains.

The second category of future work enabled by COFFE consists of circuit design investigations. In

this thesis, we investigated the area, delay and power impact of replacing pass-transistors in FPGAs by

transmission gates in conventional 22nm process technology. It would be interesting to also investigate

the impact of using FinFETs to build both pass-transistor and transmission gate FPGAs. Future work

could also include circuit topology investigations such as optimal internal buffer placement in LUTs and

efficient multiplexer topologies.

The final category of future work consists of exploring the interactions between architecture and

circuit design. For example, the circuit design aspect of track-access locality, an investigation conducted

in this work, found that reducing the portion of a routing channel that can be accessed by a logic

block input is beneficial for delay. However, this could also have an architectural implication: reduced

routability. Therefore, this is an architecture and circuit design interaction that merits a more thorough

investigation through modifications to VPR to examine the impact of track-access locality on routability.

Appendix A

N-well Sharing Sample Layout

Figure A.2 shows how pass-transistor multiplexers such as that of Figure A.1 can be laid out to efficiently

share N-wells. In this sample layout, the PMOS transistors of two multiplexers share an N-well by being

placed in a vertical strip that is two transistors wide. This two transistor wide strip of PMOS transistors

can be made taller as needed to add more multiplexers to the layout. Thus, the layout of Figure A.2 is

such that transistors requiring N-well spacing only require it on one side, which effectively reduces the

amount of N-well spacing required by 75% compared to a layout where each transistor is in a separate

well. Note that the layout of Figure A.2 is a simplified representation, as it assumes minimum-width

transistors and does not show metal layout details.

out

in1

in2

in3

in4

q1

q2

q3

q4

q5

q6

q7

q8

q9

n1 n2

Figure A.1: A single-level 4:1 pass-transistor multiplexer with two-stage buffer and level restorer.

51

Appendix A. N-well Sharing Sample Layout 52

GND VDD

in1

in2

in3

in4

Out

GND

GND

Out

GND

Pass-transistor 4:1 single-level MUX with two-stage buffer and level-restorer

N-well

NMOS

PMOS

q2

q3

q4

q1

q7

q9

q5

q6

q8

n1

n2

Figure A.2: Sample multiplexer layout with N-well sharing.

Appendix B

FPGA Circuitry Schematics

This appendix gives pass-transistor circuit schematics for all FPGA subcircuits designed in this thesis.

IN_A IN_B IN_C IN_D IN_E IN_F

SRAM

buf1 lvl1 lvl2 lvl3

buf2 buf3

buf4 buf5lvl4 lvl5 lvl6

LUT input drivers

Figure B.1: 6-LUT.

buf1

buf2 buf3

Figure B.2: LUT input driver.

53

Appendix B. FPGA Circuitry Schematics 54

Level-restorer

SRAM cell

buf2

buf3

lvl1

Connected to FF output

Connected to local routing multiplexer

outputbuf1

buf4 buf5

LUT input driver

Register feedback multiplexer

Figure B.3: LUT input driver with register feedback multiplexer.

Level-restorer

out

SRAM cell

2-stage buffer

buf1

buf2

lvl1 lvl1

lvl2

2-level MUX

Figure B.4: Two-level multiplexer used for switch block, connection block and local routing multiplexers.

Level-restorer

out

SRAM cell

2-stage buffer

buf1

buf2lvl1

2:1 MUX

Figure B.5: 2:1 multiplexer used for BLE outputs.

Appendix B. FPGA Circuitry Schematics 55

buf1

clk1 clk2

buf2

buf3

buf4

buf5

buf6

lvl1 set

reset set

reset

Input select MUX Master-slave register

Figure B.6: Flip-flop with register input selection multiplexer.

Appendix C

Detailed Transistor Sizing Results

This appendix gives transistor sizes for all subcircuit of our pass-transistor (PT) and transmission gate

(TG) FPGAs for different levels of gate boosting. See Appendix B for corresponding schematics with

transistor labels.

Table C.1: Lookup table transistor sizes.

Type VGbuf1 lvl1 lvl2 lvl3


PT VDD + 0.0 5.5 2 − 3 − 4 − 3PT VDD + 0.1 5.5 2 − 3 − 4 − 3PT VDD + 0.2 5.5 2 − 3 − 4 − 3TG VDD + 0.0 2 2.7 1 1 1 1 1 1TG VDD + 0.1 2 2.7 1 1 1 1 1 1TG VDD + 0.2 2 2.7 1 1 1 1 1 1

Type VGbuf2 buf3 lvl4 lvl5


PT VDD + 0.0 1 4.7 6.1 2 − 9 − 10PT VDD + 0.1 1 4.7 6.1 2 − 9 − 10PT VDD + 0.2 1 4.7 6.1 2 − 9 − 10TG VDD + 0.0 3.4 2 8.0 7 6 6 4 4TG VDD + 0.1 3.4 2 8.0 7 6 6 4 4TG VDD + 0.2 3.4 2 8.0 7 6 6 4 4

Type VGlvl6 buf4 buf5

PMOS NMOS PMOS NMOS PMOS NMOS

PT VDD + 0.0 − 7 3 14.0 9.9 4PT VDD + 0.1 − 7 3 13.9 10.0 4PT VDD + 0.2 − 7 3 13.3 10.0 4TG VDD + 0.0 3 3 7.8 7 9.0 6TG VDD + 0.1 3 3 7.1 7 8.5 6TG VDD + 0.2 3 3 7.7 7 9.6 7

56

Appendix C. Detailed Transistor Sizing Results 57

Table C.2: Switch block multiplexer transistor sizes.



PT VDD + 0.0 − 3 − 3 3 3.2 20.7 11PT VDD + 0.1 − 2 − 2 7.0 3 31.6 12PT VDD + 0.2 − 2 − 2 12.6 3 37.5 14TG VDD + 0.0 1 1 1 1 3 4.3 35.7 21TG VDD + 0.1 1 1 1 1 4 4.3 39.9 19TG VDD + 0.2 1 1 1 1 5.6 4 44.9 19

Table C.3: Connection block multiplexer transistor sizes.



PT VDD + 0.0 − 2 − 3 3 3.5 17.8 4PT VDD + 0.1 − 2 − 3 4.9 2 16.0 3PT VDD + 0.2 − 2 − 3 10.6 2 16.4 3TG VDD + 0.0 1 1 1 1 3 4.6 14.0 6TG VDD + 0.1 1 1 1 1 3 3.1 14.1 5TG VDD + 0.2 1 1 1 1 3.2 2 16.2 5

Table C.4: Local routing multiplexer transistor sizes. Note: We don’t give a size for buf2 of the localrouting multiplexer as it is replaced by the LUT input driver of Figure B.2.

Type VGlvl1 lvl2 buf1


PT VDD + 0.0 − 2 − 3 2 3.9PT VDD + 0.1 − 2 − 3 4.0 2PT VDD + 0.2 − 2 − 2 8.2 2TG VDD + 0.0 1 1 1 1 1 1.1TG VDD + 0.1 1 1 1 1 2 2.2TG VDD + 0.2 1 1 1 1 2.5 2

Table C.5: BLE output to local interconnect.

Type VGlvl1 buf1 buf2


PT VDD + 0.0 − 3 2 4.6 16.6 4PT VDD + 0.1 − 2 2.6 2 10.0 3PT VDD + 0.2 − 2 4.2 2 12.6 4TG VDD + 0.0 1 1 1.6 1 6.8 4TG VDD + 0.1 1 1 2 2.4 9.0 5TG VDD + 0.2 1 1 1.5 1 7.2 4


Table C.6: BLE output to general routing.

Type VGlvl1 buf1 buf2a


PT VDD + 0.0 − 3 3 6.3 32.6 20PT VDD + 0.1 − 3 5.0 5 39.1 23PT VDD + 0.2 − 2 5.1 4 39.9 24TG VDD + 0.0 1 1 4.4 3 41.4 27TG VDD + 0.1 1 1 4 4.8 41.5 28TG VDD + 0.2 1 1 4.2 4 41.9 28

aThese transistor sizes were generated with an older version of COFFE. COFFE currentlysizes the BLE output driver (buf2 ) smaller than shown in this table. For example, for apass-transistor FPGA with VG = VDD + 0.2V , buf2 has NMOS = 4 and PMOS = 9.9.

Table C.7: Flip-flop and register selection multiplexer transistor sizes.

Type VGlvl1 buf1 clk1 buf2


PT VDD + 0.0 − 5 3.5 3 1 1 4 3PT VDD + 0.1 − 5 8.2 3 1 1 4 3PT VDD + 0.2 − 5 12.6 3 1 1 4 3TG VDD + 0.0 2 2 3 3.2 1 1 4 3TG VDD + 0.1 2 2 3 3.7 1 1 4 3TG VDD + 0.2 2 2 3.0 3 1 1 4 3

Type VGbuf3 clk2 buf4 buf5


PT VDD + 0.0 1.3 1 1 1 1.3 1 1.3 1PT VDD + 0.1 1.3 1 1 1 1.3 1 1.3 1PT VDD + 0.2 1.3 1 1 1 1.3 1 1.3 1TG VDD + 0.0 1.3 1 1 1 1.3 1 1.3 1TG VDD + 0.1 1.3 1 1 1 1.3 1 1.3 1TG VDD + 0.2 1.3 1 1 1 1.3 1 1.3 1

Type VGbuf6 set reset


PT VDD + 0.0 9.4 4 1 1 1 1PT VDD + 0.1 9.7 4 1 1 1 1PT VDD + 0.2 9.0 4 1 1 1 1TG VDD + 0.0 8.0 4 1 1 1 1TG VDD + 0.1 7.8 5 1 1 1 1TG VDD + 0.2 7.9 5 1 1 1 1


Table C.8: LUT input driver A.

Type VGbuf1 buf2 buf3


PT VDD + 0.0 6.6 3 2.0 1 3 3.3PT VDD + 0.1 6.6 3 2.0 1 3 3.3PT VDD + 0.2 7.8 4 2.0 1 4 4.0TG VDD + 0.0 5.8 4 1.5 1 5.9 4TG VDD + 0.1 5.8 4 1.5 1 5.9 4TG VDD + 0.2 5.8 4 1.5 1 5.9 4

Table C.9: LUT input driver B.



PT VDD + 0.0 4.5 2 1.9 1 2 2.1PT VDD + 0.1 4.5 2 1.9 1 2 2.1PT VDD + 0.2 8.1 4 2.0 1 5.0 4TG VDD + 0.0 4.4 3 1.6 1 4.8 3TG VDD + 0.1 4.4 3 1.6 1 4.8 3TG VDD + 0.2 4.4 3 1.6 1 4.8 3

Table C.10: LUT input driver C with register feedback multiplexer (Figure B.3).

Type VGbuf1 lvl1 buf2


PT VDD + 0.0 8.6 4 − 3 2 6.3PT VDD + 0.1 7.0 3 − 2 2 2.4PT VDD + 0.2 7.9 3 − 2 2.2 2TG VDD + 0.0 6.1 4 1 1 1.2 1TG VDD + 0.1 7.2 4 1 1 2 2.11TG VDD + 0.2 8.2 4 1 1 2.2 2



PT VDD + 0.0 4.0 2 1.8 1 2.1 2PT VDD + 0.1 3.6 2 2.2 1 2.2 2PT VDD + 0.2 3.6 2 2.2 1 2.2 2TG VDD + 0.0 2.8 2 1.5 1 3.3 2TG VDD + 0.1 3.0 2 1.4 1 2.3 2TG VDD + 0.2 3.0 2 1.5 1 3.3 2


Table C.11: LUT input driver D.




Table C.12: LUT input driver E.




Table C.13: LUT input driver F.




Appendix D

Area and Delay Breakdown

Table D.1: Tile area breakdown.

Type VG SB MUX CB MUX Local MUX LUT FF BLE Output

PT VDD + 0.0 31.5% 22.0% 16.4% 26.9% 1.1% 2.1%PT VDD + 0.1 31.3% 22.0% 16.4% 26.9% 1.2% 2.2%PT VDD + 0.2 32.1% 21.8% 16.1% 26.8% 1.1% 2.1%TG VDD + 0.0 31.5% 24.3% 17.2% 24.1% 1.0% 1.9%TG VDD + 0.1 31.7% 24.2% 17.2% 24.0% 1.0% 1.9%TG VDD + 0.2 32.0% 24.1% 17.2% 23.8% 1.0% 1.9%

Table D.2: Critical path delay breakdown.

Type VG SB MUX CB MUX Local MUX LUT FF BLE Output Other

PT VDD + 0.0 37.1% 18.4% 17.0% 20.0% 0.4% 5.1% 1.9%PT VDD + 0.1 36.7% 14.5% 16.2% 25.4% 0.4% 4.7% 2.0%PT VDD + 0.2 36.2% 12.9% 13.4% 29.4% 0.5% 4.8% 2.8%TG VDD + 0.0 41.9% 16.2% 14.8% 20.5% 0.4% 3.8% 2.3%TG VDD + 0.1 41.5% 15.4% 12.8% 22.7% 0.5% 4.2% 3.0%TG VDD + 0.2 40.1% 13.7% 12.5% 26.1% 0.5% 3.9% 3.3%

61

Bibliography

[1] FPGAs Add Comms Cores Amid ASIC Debate. http://www.eetimes.com/document.asp?doc_

id=1280575. Accessed: 2013-09-16.

[2] E. Ahmed and J. Rose. The Effect of LUT and Cluster Size on Deep-Submicron FPGA Performance

and Density. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 12(3):288–298,

2004.

[3] Altera Corp. Stratix V Device Handbook, Volume1: Device Interfaces and Integration. Data Sheet,

2013.

[4] Altera Corp. Stratix V Device Overview. Data Sheet, 2013.

[5] A. Amouri, S. Kiamehr, and M. Tahoori. Investigation of Aging Effects in Different Implementa-

tions and Structures of Programmable Routing Resources of FPGAs. In Proceedings of the 2012

International Conference on Field-Programmable Technology (FPT), pages 215–219, 2012.

[6] M. Anis, M. Allam, and M. Elmasry. Impact of Technology Scaling on CMOS Logic Styles. IEEE

Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 49(8):577–588, 2002.

[7] V. Betz, J. Rose, and A. Marquardt. Architecture and CAD for Deep-Submicron FPGAs. Kluwer,

1999.

[8] Vaughn Betz and Jonathan Rose. Automatic Generation of FPGA Routing Architectures from

High-Level Descriptions. In Proceedings of the 2000 ACM/SIGDA Eighth International Symposium

on Field Programmable Gate Arrays (FPGA), pages 175–184, 2000.

[9] C. Chen, R. Parsa, N. Patil, S. Chong, K. Akarvardar, J. Provine, D. Lewis, J. Watt, R. T. Howe,

H.-S. P. Wong, and S. Mitra. Efficient FPGAs Using Nanoelectromechanical Relays. In Proceedings

of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays

(FPGA), pages 273–282, 2010.

[10] Chung-Ping Chen, C.C.N. Chu, and D. F. Wong. Fast and Exact Simultaneous Gate and Wire

Sizing by Lagrangian Relaxation. IEEE Transactions on Computer-Aided Design of Integrated

Circuits and Systems, 18(7):1014–1025, 1999.

[11] C.S. Chen and J.T. Watt. Characterization and Simulation of NMOS Pass Transistor Reliability

for FPGA Routing Circuits. In IEEE International Conference on Microelectronic Test Structures

(ICMTS), pages 216–220, 2013.

62

http://www.eetimes.com/document.asp?doc_id=1280575

http://www.eetimes.com/document.asp?doc_id=1280575

Bibliography 63

[12] Doris Chen, Deshanand Singh, Jeffrey Chromczak, David Lewis, Ryan Fung, David Neto, and

Vaughn Betz. A Comprehensive Approach to Modeling, Characterizing and Optimizing for Metasta-

bility in FPGAs. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field

Programmable Gate Arrays, pages 167–176, 2010.

[13] A. R. Conn, I. M. Elfadel, W. W. Molzen, Jr., P. R. O’Brien, P. N. Strenski, C. Visweswariah, and

C. B. Whan. Gradient-Based Optimization of Custom Circuits Using a Static-Timing Formulation.

In Proceedings of the 36th Annual ACM/IEEE Design Automation Conference, pages 452–459, 1999.

[14] A.R. Conn, P.K. Coulman, R.A. Haring, G.L. Morrill, C. Visweswariah, and C.W. Wu. JiffyTune:

Circuit Optimization Using Time-Domain Sensitivities. IEEE Transactions on Computer-Aided

Design of Integrated Circuits and Systems, 17(12):1292–1309, 1998.

[15] W. C. Elmore. The Transient Response of Damped Linear Networks with Particular Regard to

Wideband Amplifiers. Journal of Applied Physics, pages 55–63, 1948.

[16] Wei Mark Fang and Jonathan Rose. Modeling Routing Demand for Early-Stage FPGA Archi-

tecture Development. In Proceedings of the 16th International ACM/SIGDA Symposium on Field

Programmable Gate Arrays (FPGA), pages 139–148, 2008.

[17] J. P. Fishburn and A. E. Dunlop. TILOS: A Posynomial Programming Approach to Transistor

Sizing. In International Conference on Computer-Aided Design (ICCAD), pages 326–328, 1985.

[18] R. Ho, K.W. Mai, and M.A. Horowitz. The Future of Wires. Proceedings of the IEEE, 89(4):490–504,

2001.

[19] ITRS. Interconnect Chapter, 2011.

[20] J. Rubinstein, P. Penfield and M. Horowitz. Signal Delay in RC Tree Networks. IEEE Transactions

on Computer-Aided Design (TCAD), 2(3):202–211, July 1983.

[21] K. Kasamsetty, M. Ketkar, and S.S. Sapatnekar. A New Class of Convex Functions for Delay

Modeling and its Application to the Transistor Sizing Problem. IEEE Transactions on Computer-

Aided Design of Integrated Circuits and Systems, 19(7):779–788, 2000.

[22] M. Khellah, S. D. Brown, and Z. Vranesic. Modeling Routing Delays in SRAM-based FPGAs. In

Canadian Conference on VLSI, pages 6B.13–6B.18, 1993.

[23] S. Kiamehr, A. Amouri, and M.B. Tahoori. Investigation of NBTI and PBTI Induced Aging in

Different LUT Implementations. In International Conference on Field-Programmable Technology

(FPT), pages 1–8, 2011.

[24] Ian Kuon and J. Rose. Exploring Area and Delay Tradeoffs in FPGAs With Architecture and

Automated Transistor Design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,

19(1):71–84, 2011.

[25] Ian Kuon and Jonathan Rose. Measuring the Gap Between FPGAs and ASICs. IEEE Transactions

on Computer-Aided Design of Integrated Circuits and Systems, 26(2):203–215, 2007.

Bibliography 64

[26] A. Lam, S. J E Wilton, P. Leong, and W. Luk. An Analytical Model Describing the Relationships

Between Logic Architecture and FPGA Density. In International Conference on Field Programmable

Logic and Applications (FPL), pages 221–226, 2008.

[27] Edmund Lee, Guy Lemieux, and Shahriar Mirabbasi. Interconnect Driver Design for Long Wires

in Field-Programmable Gate Arrays. Journal of Signal Processing Systems, 51(1):57–76, 2008.

[28] G. Lemieux, E. Lee, M. Tom, and A. Yu. Directional and Single-Driver Wires in FPGA Interconnect.

In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT),

pages 41–48, 2004.

[29] Guy Lemieux and David Lewis. Using Sparse Crossbars within LUT Clusters. In Proceedings of

the 2001 ACM/SIGDA Ninth International Symposium on Field Programmable Gate Arrays, pages

59–68, 2001.

[30] Guy Lemieux and David Lewis. Circuit Design of Routing Switches. In Proceedings of the 2002

ACM/SIGDA Tenth International Symposium on Field-Programmable Gate Arrays, pages 19–28,

2002.

[31] Guy Lemieux and David Lewis. Design of Interconnection Networks for Programmable Logic.

Springer, 2004.

[32] D. Lewis and J. Chromczak. Process Technology Implications for FPGAs. In IEEE International

Electron Devices Meeting (IEDM), pages 25.2.1–25.2.4, 2012.

[33] David Lewis, Elias Ahmed, Gregg Baeckler, Vaughn Betz, Mark Bourgeault, David Cashman,

David Galloway, Mike Hutton, Chris Lane, Andy Lee, Paul Leventis, Sandy Marquardt, Cameron

McClintock, Ketan Padalia, Bruce Pedersen, Giles Powell, Boris Ratchev, Srinivas Reddy, Jay

Schleicher, Kevin Stevens, Richard Yuan, Richard Cliff, and Jonathan Rose. The Stratix II Logic

and Routing Architecture. In Proceedings of the ACM/SIGDA 13th International Symposium on

Field-Programmable Gate Arrays (FPGA), pages 14–20, 2005.

[34] David Lewis, Vaughn Betz, David Jefferson, Andy Lee, Chris Lane, Paul Leventis, Sandy Mar-

quardt, Cameron McClintock, Bruce Pedersen, Giles Powell, Srinivas Reddy, Chris Wysocki,

Richard Cliff, and Jonathan Rose. The Stratix™ Routing and Logic Architecture. In Proceed-

ings of the ACM/SIGDA Eleventh International Symposium on Field Programmable Gate Arrays

(FPGA), pages 12–20, 2003.

[35] David Lewis, David Cashman, Mark Chan, Jeffery Chromczak, Gary Lai, Andy Lee, Tim Van-

derhoek, and Haiming Yu. Architectural Enhancements in Stratix V™. In Proceedings of the

ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), pages 147–

156, 2013.

[36] Jason Luu, Ian Kuon, Peter Jamieson, Ted Campbell, Andy Ye, Wei Mark Fang, Kenneth Kent,

and Jonathan Rose. VPR 5.0: FPGA CAD and Architecture Exploration Tools with Single-Driver

routing, Heterogeneity and Process Scaling. ACM Trans. Reconfigurable Technol. Syst., 4(4):32:1–

32:23, December 2011.

Bibliography 65

[37] Imran Masud and Steven Wilton. A New Switch Block for Segmented FPGAs. In Field Pro-

grammable Logic and Applications, volume 1673 of Lecture Notes in Computer Science, pages 274–

281. 1999.

[38] Microsemi Corp. IGLOO2 FPGAs Product Brief. Data Sheet, 2013.

[39] Takumi Okamoto and Jason Cong. Buffered Steiner Tree Construction with Wire Sizing for Inter-

connect Layout Optimization. In Proceedings of the 1996 IEEE/ACM International Conference on

Computer-Aided Design, pages 44–49, 1996.

[40] J. K. Ousterhout. Switch-Level Delay Models for Digital MOS VLSI. In Papers on Twenty-five

years of electronic design automation, 25 years of DAC, pages 489–495, New York, NY, USA, 1988.

ACM.

[41] Tao Pi and Patrick J Crotty. FPGA Lookup Table with Transmission Gate Structure for Reliable

Low-Voltage Operation, December 23 2003. US Patent 6,667,635.

[42] Predictive Technology Model (PTM). http://ptm.asu.edu/, 2012.

[43] Jonathan Rose, Jason Luu, Chi Wai Yu, Opal Densmore, Jeff Goeders, Andrew Somerville, Ken-

neth B. Kent, Peter Jamieson, and Jason Anderson. The VTR Project: Architecture and CAD for

FPGAs from Verilog to Routing. In Proceedings of the 20th ACM/SIGDA International Symposium

on Field-Programmable Gate Arrays (FPGA), pages 77–86, 2012.

[44] S.S. Sapatnekar, V.B. Rao, P.M. Vaidya, and Sung-Mo Kang. An Exact Solution to the Tran-

sistor Sizing Problem for CMOS Circuits Using Convex Optimization. Computer-Aided Design of

Integrated Circuits and Systems, IEEE Transactions on, 12(11):1621–1634, 1993.

[45] J-M Shyu, Alberto Sangiovanni-Vincentelli, John P Fishburn, and Alfred E Dunlop. Optimization-

Based Transistor Sizing. IEEE Journal of Solid-State Circuits, 23(2):400–409, 1988.

[46] A.M. Smith, G.A. Constantinides, and P. Y K Cheung. FPGA Architecture Optimization Using

Geometric Programming. IEEE Transactions on Computer-Aided Design of Integrated Circuits and

Systems, 29(8):1163–1176, 2010.

[47] V. Sundararajan, S.S. Sapatnekar, and K.K. Parhi. Fast and Exact Transistor Sizing Based on

Iterative Relaxation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and

Systems, 21(5):568–581, 2002.

[48] Anil Telikepalli. Power vs. Performance: The 90 nm Inflection Point. Xilinx White Paper, 223,

2005.

[49] B. Tseng, J. Rose, and S. Brown. Improving FPGA Routing Architectures Using Architecture and

CAD Interactions. In Proceedings of the IEEE 1992 International Conference on Computer Design:

VLSI in Computers and Processors, pages 99–104, 1992.

[50] Henry Wong, Vaughn Betz, and Jonathan Rose. Comparing FPGA vs. Custom CMOS and the

Impact on Processor Microarchitecture. In Proceedings of the 19th ACM/SIGDA International

Symposium on Field Programmable Gate Arrays, pages 5–14, 2011.

Bibliography 66

[51] Xilinx Inc. 7 Series FPGAs Overview. Data Sheet, 2012.

[52] Xilinx Inc. 7 Series FPGAs Configurable Logic Block. Data Sheet, 2013.

[53] Saeyang Yang. Logic Synthesis and Optimization Benchmarks, Version 3.0. In Tech. Report. MCNC,

1991.

Documents

Optimization and Modeling of FPGA Circuitry in …...4.4 Switch block multiplexer transistor sizes for PT and TG implementations for di erent levels of gate boosting (see Figure 4.2