Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Optimization and Modeling of FPGA Circuitry in Advanced ProcessTechnology
by
Charles Chiasson
A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science
Graduate Department of Electrical and Computer EngineeringUniversity of Toronto
© Copyright 2013 by Charles Chiasson
Abstract
Optimization and Modeling of FPGA Circuitry in Advanced Process Technology
Charles Chiasson
Master of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto
2013
We develop a new fully-automated transistor sizing tool for FPGAs that features area, delay and wire
load modeling enhancements over prior work to improve its accuracy in advanced process nodes. We then
use this tool to investigate a number of FPGA circuit design related questions in a 22nm process. We
find that building FPGAs out of transmission gates instead of the currently dominant pass-transistors,
whose performance and reliability are degrading with technology scaling, yields FPGAs that are 15%
larger but are 10-25% faster depending on the allowable level of “gate boosting”. We also show that
transmission gate FPGAs with a separate power supply for their gate terminal enable a low-voltage
FPGA with 50% less power and good delay. Finally, we show that, at a possible cost in routability,
restricting the portion of a routing channel that can be accessed by a logic block input can improve delay
by 17%.
ii
Acknowledgements
First, I would like to express my sincerest gratitude to my supervisor Vaughn Betz for his guidance
and motivation, for his technical help and for the tidbits of wisdom that he shared with me, knowingly
or unknowingly, over the past two years. I learned so much in so little time and cannot imagine having
had a better mentor.
I also extend thanks to the other graduate students in Vaughn Betz’s research group for all their
help and support. Also, thanks for the lunch outings, the coffee breaks and the squash matches, among
other things, that provided those much needed distractions.
I would like to thank the Natural Sciences and Engineering Research Council of Canada, Altera Cor-
poration and the University of Toronto for their financial support. Thanks also go to CMC Microsystems
for providing the CAD tools used throughout this research. I would also like to thank David Lewis from
Altera Corporation for the insightful discussions.
Finally, thanks must undoubtedly go to my parents for nurturing my inherent desire to know why,
for making me love the smell of new books, and simply, for being the best parents a kid could ask for.
All of my accomplishments are certainly due to them.
iii
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Logic Block Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Routing Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Commercial BLE Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 FPGA Architecture Assessment Methodology . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 FPGA Circuit Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.1 SRAM cells . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.2 Routing Multiplexers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3 Lookup Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.4 Flip-Flops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Modeling of FPGA Circuitry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Area Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.2 Delay Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Automated Transistor Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3 COFFE: Automated Optimization of FPGA Circuitry 17
3.1 Introduction to COFFE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Circuit Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Area Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Delay Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5.1 Non-Linearity of Transistor Resistance and Capacitance . . . . . . . . . . . . . . . 26
3.5.2 Topology Dependence of Transistor Resistance . . . . . . . . . . . . . . . . . . . . 26
3.6 Wire Load Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7 Transistor Sizing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.7.1 Divide-and-Conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7.2 Pre-Determined P/N Ratios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.7.3 Detailed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.8 Impact of Improved Wire Load Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
iv
3.8.1 Base Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.8.2 Target Process Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.8.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.9 Integration of COFFE with VPR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4 Efficient FPGA Circuitry 35
4.1 Fcout for Single-Driver Routing and Multiple BLE Outputs . . . . . . . . . . . . . . . . . 35
4.2 Transmission Gate FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Pass-Transistor Scaling Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.2 Replacing Pass-Transistors with Transmission Gates . . . . . . . . . . . . . . . . . 37
4.2.3 Gate-Boosting Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.4 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.6 Area and Delay Breakdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3 Separating VDD and VG for Low-Power FPGAs . . . . . . . . . . . . . . . . . . . . . . . . 45
4.4 Track-Access Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5 Conclusions and Future Work 49
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A N-well Sharing Sample Layout 51
B FPGA Circuitry Schematics 53
C Detailed Transistor Sizing Results 56
D Area and Delay Breakdown 61
Bibliography 62
v
List of Tables
3.1 COFFE’s expected input architecture parameters. . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Resistance of a 4× minimum-width NMOS transistor for different circuit topologies (Fig-
ure 3.9) and switching-thresholds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Rise-fall re-balancing and the effect of M on COFFE’s transistor sizing solutions (example). 31
3.4 Base architecture parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Subcircuit count per tile for base architecture. . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.6 Metal layer data used by COFFE for all circuit design investigations (ITRS [19]). . . . . . 33
3.7 Impact of wire loading. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1 Effect of Fcout on channel width and switch block multiplexers. . . . . . . . . . . . . . . . 36
4.2 Area and delay for different Fcout values. . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Pass-transistor and transmission gate FPGA tile area for different levels of gate boosting. 42
4.4 Switch block multiplexer transistor sizes for PT and TG implementations for different
levels of gate boosting (see Figure 4.2 for transistor labels). Note that with the exception
of P/N ratios, COFFE uses integer granularity. . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5 Pass-transistor and transmission gate FPGA critical path delay for different levels of gate
boosting (VTR benchmarks). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6 Pass-transistor and transmission gate FPGA area-delay product for different levels of gate
boosting (VTR benchmarks). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.7 Pass-transistor and transmission gate FPGA relative dynamic power for different levels
of gate boosting (VTR benchmarks). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.8 Effect of cluster output track-access locality on area and delay. Input track-access span
is set to 0.5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.9 Effect of cluster input track-access locality on area and delay. Output track-access span
is set to 0.25. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
C.1 Lookup table transistor sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
C.2 Switch block multiplexer transistor sizes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
C.3 Connection block multiplexer transistor sizes. . . . . . . . . . . . . . . . . . . . . . . . . . 57
C.4 Local routing multiplexer transistor sizes. Note: We don’t give a size for buf2 of the local
routing multiplexer as it is replaced by the LUT input driver of Figure B.2. . . . . . . . . 57
C.5 BLE output to local interconnect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
C.6 BLE output to general routing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
C.7 Flip-flop and register selection multiplexer transistor sizes. . . . . . . . . . . . . . . . . . . 58
vi
C.8 LUT input driver A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
C.9 LUT input driver B. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
C.10 LUT input driver C with register feedback multiplexer (Figure B.3). . . . . . . . . . . . . 59
C.11 LUT input driver D. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
C.12 LUT input driver E. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
C.13 LUT input driver F. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
D.1 Tile area breakdown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
D.2 Critical path delay breakdown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
vii
List of Figures
1.1 Architecture exploration with manual (a) and automated (b) transistor-level design. . . . 2
2.1 Tile-based FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Basic logic element (BLE). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Logic cluster architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Routing segment lengths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Multi-driver and single-driver routing architectures. . . . . . . . . . . . . . . . . . . . . . . 8
2.6 FPGA architecture assessment methodology with VPR. . . . . . . . . . . . . . . . . . . . 9
2.7 Six transistor SRAM cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.8 Different 8:1 pass-transistor multiplexer topologies. . . . . . . . . . . . . . . . . . . . . . . 11
2.9 Multiplexer followed by two-stage buffer with PMOS level-restorer. . . . . . . . . . . . . . 11
2.10 Fully encoded MUX tree 3-LUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.11 Minimum-width transistor area model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.12 Increasing drive strength with diffusion widening (b) or parallel diffusion regions (c). Note:
Although not shown in the figure for simplicity, parallel diffusions must be connected
together. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 FPGA design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 COFFE’s supported tile architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 COFFE’s routing multiplexer circuit topologies. . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Fully encoded MUX tree 6-LUT with internal re-buffering (partial view). . . . . . . . . . 22
3.5 Static transmission gate-based master-slave register. . . . . . . . . . . . . . . . . . . . . . 22
3.6 Transistor area prediction accuracy of original (Eq. 2.2) and improved (Eq. 3.1) area
models against TSMC 65nm layouts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.7 Combining diffusion widening and parallel diffusion regions yields denser layouts (c). . . . 24
3.8 A switch-level model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.9 Circuits used to measure transistor resistance. . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.10 Inverter NMOS and PMOS resistivity vs. transistor width. . . . . . . . . . . . . . . . . . 27
3.11 COFFE’s transistor sizing algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 VDD and VTh scaling trends [12]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Generic two-level routing multiplexer with two-stage buffer implemented with pass-transistors
(a) and transmission gates (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Effect of different gate boosting strategies on transmission gate switch block multiplexer
delay (VDD = 0.8V ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
viii
4.4 CAD flow for each FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 Tile area and critical path delay breakdown. . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.6 Critical path delay for pass-transistor (PT) and transmission gate (TG) FPGAs for dif-
ferent VDD and VG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.7 Dynamic power for pass-transistor (PT) and transmission gate (TG) FPGAs for different
VDD and VG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.8 Power-delay product for pass-transisor (PT) and transmission gate (TG) FPGAs for dif-
ferent VDD and VG. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.9 Cluster output wire load for different locality. . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.10 Cluster input wire load for different locality. . . . . . . . . . . . . . . . . . . . . . . . . . . 47
A.1 A single-level 4:1 pass-transistor multiplexer with two-stage buffer and level restorer. . . . 51
A.2 Sample multiplexer layout with N-well sharing. . . . . . . . . . . . . . . . . . . . . . . . . 52
B.1 6-LUT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
B.2 LUT input driver. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
B.3 LUT input driver with register feedback multiplexer. . . . . . . . . . . . . . . . . . . . . . 54
B.4 Two-level multiplexer used for switch block, connection block and local routing multiplexers. 54
B.5 2:1 multiplexer used for BLE outputs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
B.6 Flip-flop with register input selection multiplexer. . . . . . . . . . . . . . . . . . . . . . . . 55
ix
Chapter 1
Introduction
1.1 Motivation
The design and fabrication of modern digital integrated circuits costs tens to hundreds of millions of
dollars, requires large teams of engineers and years of effort. Indeed, the cost of developping a new
20nm chip has been estimated to be as high as $160 million USD [1]. While this may be acceptable for
high-volume applications, it can be a significant burden for lower-volume designs, often preventing them
from being fabricated in the latest process technologies. Instead of being fabricated as a custom chip
such as a standard cell-based application-specific integrated circuit (ASIC) or a full custom design, a dig-
ital design can be implemented in a field-programmable gate array (FPGA). FPGAs are pre-fabricated,
programmable devices into which one can implement any arbitrary digital design in a matter of seconds.
Therefore, FPGAs are an attractive alternative to ASICs or full custom designs because they allow the
high non-recurring engineering costs and lengthy design times associated with semiconductor manufac-
turing to be completely avoided. However, digital designs that require high-density, high-performance
or low-power might not find FPGAs as attractive. It has been shown that FPGAs require 35× more
silicon area, are 4× slower and consume 14× more dynamic power than ASICs [25]. Accordingly, mini-
mizing the FPGA-to-ASIC gap, that is, making FPGAs as efficient as possible such that they become a
competitive implementation medium for all types of applications, is one of the primary drivers of FPGA
research for both academic researchers and commercial FPGA manufacturers.
The area, performance and power characteristics of an FPGA can be optimized at two main lev-
els: architecture and transistor-level design. The architecture of an FPGA is defined by a number of
parameters that describe the style and flexibility of its soft-logic blocks, dedicated hard-blocks and in-
terconnect. Finding an architecture that meets specific design goals and constraints involves setting
these architectural parameters to specific values. However, these parameters interact in complex ways to
produce area, delay and power trade-offs that are very difficult to quantify through analytical methods.
For that reason, finding the right architectural parameter values is usually accomplished experimentally
with automated architecture explorations tools such as VPR [7].
For any architecture, there are a number of different transistor-level implementations. Transistor-
level design consists of choosing circuit topologies for an architecture as well as sizing the transistors of
those circuits. Both circuit topology selection and transistor sizing provide opportunities to optimize the
area, delay and power of the architecture. In prior FPGA research work, transistor-level design was often
1
Chapter 1. Introduction 2
Manual transistor-level design
Evaluate architecture
Change architecture parameters
Initial architecture parameters
(a) Manual transistor-level design.
Automated transistor-level design
Evaluate architecture
Change architecture parameters
Initial architecture parameters
(b) Automated transistor-level design.
Figure 1.1: Architecture exploration with manual (a) and automated (b) transistor-level design.
performed manually making it a task that required a significant amount of time and effort. This often
had a negative impact on the architecture exploration flow, which would proceed as follows. Manual
transistor-level design would be performed on some initial architecture. Then, this architecture would be
assessed with an architecture exploration tool such as VPR. Based on the results of the assessment, the
architecture parameters would be adjusted and the evaluation process would be repeated. Ideally, one
would then re-optimize the transistor-level design to match the new architecture parameters. However,
since manual transistor-level design was such a time and effort intensive task, this step would often be
skipped. It was assumed that transistor sizes obtained with a previous architecture still applied to the
new architecture and this new architecture was then evaluated without re-optimizing its transistor-level
design. This architecture exploration flow is illustrated in Figure 1.1a. The new architecture could likely
be made more efficient if it’s transistor sizes were re-optimized. As well, the detailed impact of new
wire loads as the architecture and its area changed have often not been rigorously modeled, possibly
leading to inaccurate architecture conclusions. In an environment where FPGAs need to be as efficient
as possible to compete with ASICs, new architectures should be evaluated in their most efficient state.
It follows that re-optimizing the transistor sizes as the FPGA architecture is changed provides a more
thorough design space exploration and should yield more efficient FPGAs.
Automating the transistor-level design of FPGAs enables such frequent re-optimization (Figure 1.1b).
In addition, an automated transistor-level design tool facilitates investigations relating to efficient FPGA
circuitry. For example, an automated transistor-level design tool could be used to explore the impact of
different circuit topologies or the impact of different layout choices on the area, delay and power of an
FPGA.
This thesis consists of two parts. In the first, we develop COFFE (Circuit Optimization For FPGA
Exploration), a new fully-automated transistor sizing tool for FPGAs. Although an FPGA-specific
transistor sizing tool has been developed in prior work [24], we have made significant improvements that
are necessary in advanced process nodes. In the second part of this thesis, we use COFFE to investigate
a number of circuit design related questions in advanced process technology.
Chapter 1. Introduction 3
1.2 Thesis Organization
This thesis is organized as follows. Chapter 2 provides background information on FPGA architecture,
circuit design, modeling and optimization. Chapter 3 describes COFFE, a fully-automated transistor
sizing tool for FPGAs developed as part of this thesis, as well as our area and delay modeling enhance-
ments. A number of FPGA circuit design investigations are performed with COFFE in Chapter 4.
Finally, Chapter 5 concludes this thesis and suggests future work.
Chapter 2
Background
This thesis is focused on the transistor-level design of SRAM-based FPGAs and related computer-aided
design (CAD) tools. We develop a fully-automated transistor sizing tool for FPGAs in Chapter 3 and
use it to investigate a number of FPGA circuit design related questions in Chapter 4. This chapter
provides relevant background material. First, we review FPGA architecture and the standard FPGA
architecture assessment methodology. Then, we describe common practices in FPGA circuit design as
well as commonly used area and delay modeling techniques for these circuits. Finally, we review prior
work on automated transistor sizing.
2.1 FPGA Architecture
An FPGA consists of an array of tiles that can each implement a small amount of logic and routing.
Horizontal and vertical routing channels run on top of the tiles and allow them to be stitched together to
perform larger functions. Figure 2.1 illustrates FPGA tile architecture at a high-level. A logic block (LB)
supplies the tile’s logic functionality. Connection blocks (CBs) provide connectivity between logic block
inputs and routing channels. A switch block (SB) connects logic block outputs to routing channels and
provides connectivity between wires within the routing channels. One replicates this basic tile to obtain
a complete FPGA. Although Figure 2.1 shows logic and switching functions as distinct sub-blocks, an
interleaved layout is more realistic and is what we assume throughout this work.
The FPGA architecture described in the previous paragraph represents a generic soft-logic-based
FPGA. Modern FPGAs are more heterogeneous. That is, in addition to general purpose soft-logic
blocks, they also contain dedicated hard-blocks such as multipliers, block memories or even embedded
processors [36, 51, 4, 38]. In this work, we focus on the architecture and circuit design of the soft-logic
portion as it still forms the backbone of an FPGA and typically accounts for a large fraction of it’s area1
and critical path delay as shown in Section 4.2.6. However, since hard-blocks are an important part
of modern FPGA architectures, all our VPR [7] experiments are performed with architecture files that
contain multipliers and block memories along with our soft-logic blocks. We use the same multiplier and
block memory designs across all our VPR experiments, and hence they are constant and do not affect
the conclusions of our soft-logic investigations.
1In [50], it was reported that the core area of the largest Stratix III FPGA consists of ∼72% soft-logic and associatedprogrammable routing; the other 28% being block memory and multipliers.
4
Chapter 2. Background 5
LB
SBCB
FPGA Tile
Routing Channel
CB
Figure 2.1: Tile-based FPGA.
K-LUT FF
Figure 2.2: Basic logic element (BLE).
2.1.1 Logic Block Architecture
Most FPGAs are built around the idea of using lookup tables (LUTs) to implement logic functions. A
K-input LUT can implement any combinational logic function of K inputs. Since digital designs are
rarely purely combinational, the basic logic element (BLE) of an FPGA consists of a K-LUT and a
flip-flop (FF) that both feed a 2:1 multiplexer which allows the output of the BLE to be driven by either
the LUT output or the FF output as illustrated in Figure 2.2 [7].
Although an FPGA logic block could consist of a single BLE, it is much more common to group
several BLEs together in the same logic block to form a locally interconnected logic cluster as this fast
local interconnect can improve performance and save general routing area [7, 2]. The number of inputs
to a LUT (K) and the number of BLEs in a logic cluster (N) are two important architectural parameters
affecting the area and performance of an FPGA. Ahmed and Rose showed in [2] that K = 4 to 6 and
N = 4 to 10 are good choices in terms of area-delay product. Modern commercial architectures use
comparable values for N and K (Virtex 7: K=6, N=8 [51] and Stratix V: K=6, N=10 [35]).
As illustrated in Figure 2.3, a logic cluster’s local interconnect consists of two types of wires: local
feedback wires and cluster input wires. There are typically N local feedback wires in a cluster; one for
each BLE. Often, many BLEs in a cluster will share common inputs. Accordingly, the number of inputs
to a cluster (I) is less than the number of distinct BLE inputs in a cluster (i.e. N ×K). It was shown
in [2] that (2.1) is a good estimate of the number of inputs required to achieve 98% LUT utilization.
I =K
2(N + 1) (2.1)
Chapter 2. Background 6
BLE with internal details shown
BLE
BLE
Total of N BLEs
Local feedback
wires
I cluster inputs
K local routing MUXes
per BLE
K-LUT FF
N BLE outputs
Figure 2.3: Logic cluster architecture.
Local routing multiplexers connect multiple local interconnect wires to each BLE input. These
multiplexers are generally sparsely populated [29]. That is, BLE inputs can be connected to only a
fraction of the wires in the local interconnect; we refer to this fraction as Fclocal. Sparsely populating
the local routing multiplexers reduces their size and thus saves area. In [29], it was shown that reducing
Fclocal from 1.0 to 0.5 reduces area by 10% with no degradation in critical path delay. However, as
recommended by [29], between 2 to 5 spare cluster inputs should be added to (2.1) when sparsely
populating the local routing multiplexers to maintain routability.
2.1.2 Routing Architecture
Logic blocks are interconnected by programmable routing channels that run horizontally and vertically
on top of a tile (Figure 2.1). The number of tracks in a routing channel is refered to as its width (W).
In this work, we assume that the width of horizontal and vertical routing channels are equivalent, but
it is possible for them to be different. For example, the horizontal routing channels on Stratix FPGAs
are wider than the vertical channels due to the rectangular layout of their logic blocks [34].
A routing track is composed of wire segments that span one or more tiles. The length (L) of a routing
segment specifies the number of tiles that it spans. For example, Figure 2.4 shows a routing channel
that consists of four tracks of L = 2 wire segments and four tracks of L = 4 wire segments. Note that
staggering the start point of wire segments as in Figure 2.4 is necessary for a tile-based layout as it
ensures that all tiles remain identical [8].
A horizontal and a vertical routing channel intersect at each tile. The set of programmable switches
that allow connections to be made between routing tracks at this intersection is called a switch block
(SB in Figure 2.1). Switch block flexibility (FS) specifies the number tracks to which any track can
connect in a switch block. An FS of 3, where each horizontal track connects to another horizontal
track and two vertical tracks (and vice-versa), is common [49]. The specific tracks to which each track
connects is determined by the switch block pattern [7, 37] as well as the routing driver architecture. In
Chapter 2. Background 7
FPGA tiles
Length 2 wire segments
Length 4 wire segments
Figure 2.4: Routing segment lengths.
a multi-driver routing architecture (Figure 2.5a), a wire can be driven by multiple tri-state drivers at
multiple points along its length. In contrast, in a single-driver routing architecture (Figure 2.5b), a wire
can only be driven by a single multiplexer-based driver usually placed at one end of the wire. Figures
2.5a and 2.5b also show that logic block outputs connect to the routing tracks differently based on the
routing driver architecture. That is, multi-driver architectures connect logic block outputs directly to the
routing wires while single-driver architectures connect logic block outputs to the routing wires through
switch-block multiplexers. Although multi-driver routing architectures have been widely used in the
past [7, 2], single-driver routing has become the dominant routing architecture style in both academic
research [28, 27, 24] and commercial FPGAs [34, 33]. In this work, we focus on single-driver routing
architectures. In [28], Lemieux et al. found that FPGAs with single-driver routing had 9% lower delay
and were 25% smaller than FPGAs with multi-driver routing.
Connection block multiplexers connect multiple routing tracks to each logic block input (see Figure
2.5). The number of tracks that can connect to each logic block input is called the connection block
input flexibility (Fcin). Similarly, the number routing wires that each logic block output can connect to
is given by the connection block output flexibility (Fcout). Reducing Fcin from W to 0.2W as the logic
cluster size increases from N = 1 to 20 and using an Fcout of W/N were found to be good choices in [7].
These interconnect flexibility values have generally been used as rules of thumb in subsequent FPGA
research.
2.1.3 Commercial BLE Architectures
The BLEs of modern commercial FPGAs are much more complex than the commonly used academic BLE
described in Section 2.1.1 (Figure 2.2). Instead of a single K-LUT, some modern FPGA architectures
[33, 35, 51] use fracturable LUTs, which are LUTs that can be configured as one large LUT or multiple
smaller LUTs. For example, the Stratix V fracturable 6-LUT can be split into two 5-LUTs or four
4-LUTs provided that the functions being mapped to these LUTs meet certain constraints [35]. Modern
BLEs also commonly support configuring LUTs as memories (LUTRAM) or shift registers and usually
contain hard arithmetic carry logic [35, 52]. However, to keep the scope of this work tractable, we only
consider regular K-LUTs, which are still relevant, and we do not consider carry logic as current academic
CAD tools do not fully support this functionality.
The commonly used academic BLE shown in Figure 2.2 has a very limited ability to use both the
lookup table and flip-flop together. Modern commercial BLEs include additional 2:1 multiplexers to
allow the lookup table and flip-flop to be used in concert in many more ways [3, 52]. These extra
multiplexers are included in our designs and will be described in more detail in Section 3.2.
Chapter 2. Background 8
LB CBLB
SBCB
CB LB CB
Connection block MUX
Tri-state drivers
Drivers at mid-points
LB output connects to routing wire via
tri-state driver
(a) Multi-driver architecture.
LB CBLB
SBCB
CB LB CB
Connection block MUX
Switch block MUX
No drivers at mid-points
LB output connects to routing wire via
SB mux
(b) Single-driver architecture.
Figure 2.5: Multi-driver and single-driver routing architectures.
Chapter 2. Background 9
VPR architecture description file
VPR
Synthesized benchmark circuits
Pack into logic clusters
Place clusters into FPGA
Route connections between clusters
Analyze timing and area
Benchmark circuits
Synthesize and map circuits to FPGA LUTs, FF, etc.
Architecture description
Figure 2.6: FPGA architecture assessment methodology with VPR.
2.2 FPGA Architecture Assessment Methodology
The quality of an FPGA in terms of area, performance and power consumption is a function of the
architectural parameters described in Section 2.1. These architecture parameters interact in complex
ways; hence determining the best choice for each parameter is a challenging task. Although there has
been some work towards developing analytical models to evaluate FPGA architectures [46, 26, 16],
the standard architecture assessment procedure used by both commercial FPGA manufacturers and
academic researchers is an experimental one that consists of implementing benchmark circuits on a
candidate architecture in order to evaluate its area, delay and power.
Figure 2.6 shows the standard academic CAD flow used to evaluate FPGA architectures [7]. The
CAD flow proceeds as follows. Benchmark circuits are first synthesized and mapped into lookup tables
(LUTs), flip-flops (FF) and hard-blocks (multipliers and block memories) based on a description of the
architecture. LUTs and FFs are then packed into clusters in a manner that attempts to keep related
LUTs and FFs in the same cluster such that connections between them can be routed through the logic
cluster’s fast local interconnect. Next, each cluster is placed into a specific logic block on the FPGA
that minimizes both the delay and the wire length of connections between logic clusters as much as
possible. Once all logic clusters have been placed, connections between logic blocks are routed through
the FPGA’s general purpose interconnect. The routing algorithm tries to minimize the benchmark
circuit’s critical path delay, while using the least amount of routing resources possible. Finally, timing
analysis is performed to determine the implemented benchmark circuit’s critical path delay and area is
calculated based on tile area and the number of logic blocks required by the placement.
The packing, placement and routing phases of the flow of Figure 2.6 are performed by VPR [7].
Since many of the algorithms used by VPR are timing-based, the VPR architecture file must describe
Chapter 2. Background 10
VSRAM+
VSRAM-
WL
BLBL
Figure 2.7: Six transistor SRAM cell.
the delays through the lookup tables, routing multiplexers and any other circuitry that makes up the
FPGA. The delay of these circuits depend on the circuit topologies used, as well as the transistor sizing
of the FPGA circuitry. Consequently, evaluting an FPGA architecture requires first completing its
transistor-level design.
2.3 FPGA Circuit Design
As mentioned in Section 2.1, we only consider soft-logic-based FPGAs with single-driver routing architec-
tures in this thesis. Soft-logic FPGA architectures consists entirely of SRAM cells, routing multiplexers,
lookup tables and flip-flops. This section describes commonly used circuit topologies and circuit design
practices for these structures.
2.3.1 SRAM cells
An FPGA typically contains millions of memory bits used to configure routing multiplexers and store
lookup table logic functions. Because there are so many of them, a key design goal for these memory
bits is small area. Stability is also important, as state flipping would cause problems such as incor-
rectly configured routing multiplexers. A six transistor SRAM cell (Figure 2.7) has been the standard
implementation in FPGA research [7] as it achieves both design goals reasonably well.
2.3.2 Routing Multiplexers
Routing multiplexers account for a large fraction of the area and delay of an FPGA. Consequently,
it is crucial to choose a circuit implementation that is as efficient as possible. There are a number
of approaches that can be taken to build a multiplexer but most commercial FPGAs and almost all
academic FPGA studies use an NMOS pass-transistor-based approach because each switch requires only
one transistor, minimizing area. Figure 2.8 shows three of the most commonly used pass-transistor
multiplexer topologies. Each multiplexer style possesses a different area-delay tradeoff that is a function
of the number of multiplexer inputs [27, 9]. For example, since it has just one pass-transistor on the
signal path, a 1-level multiplexer is generally faster than a 2-level multiplexer. But, for a large number
of inputs, a 1-level multiplexer requires more SRAM cells than a 2-level multiplexer and can thus have
larger area. Furthermore, if the the number of inputs is large enough, a 1-level multiplexer could even
become slower than a 2-level multiplexer due to a greater number of transistors loading the output node.
Chapter 2. Background 11
SRAM cell
out
(a) Tree MUX.
SRAM cell
out
(b) 1-level MUX.
SRAM cell
out
(c) 2-level MUX.
Figure 2.8: Different 8:1 pass-transistor multiplexer topologies.
Level-restorer
out
2-stage bufferMUX
Figure 2.9: Multiplexer followed by two-stage buffer with PMOS level-restorer.
It was shown in [9] that a 2-level multiplexer generally yields a lower area-delay product than a 1-level
or tree multiplexer. Commercial FPGAs also commonly use 2-level multiplexers [33].
Although they are beneficial in terms of area, pass-transistors have an important disadvantage: they
are incapable of passing a full logic-high voltage. That is, their output voltage saturates at approximately
VG − VTh where VG is the gate voltage and VTh is the threshold voltage of the transistor. In FPGA
circuitry, the output of a pass-transistor-based routing multiplexer is typically driven by a multi-stage
buffer [7, 30, 33]. Static power dissipation in these buffers caused by the reduced voltage swing of pass-
transistors has long been a cause for concern [7]. To mitigate this problem, gate boosting [7] (applying
a voltage larger than the supply voltage (VDD) on the pass-transistor gate) and PMOS level-restorers
[30, 33] have been used to help pull pass-transistor output voltages up to VDD. Figure 2.9 shows a
routing multiplexer followed by a two-stage buffer equiped with a PMOS-level restorer.
Chapter 2. Background 12
SRAM cells
Level-restorer
out
2-stage buffer3-LUT
LUT inputs
A B C
Figure 2.10: Fully encoded MUX tree 3-LUT.
2.3.3 Lookup Tables
Like routing multiplexers, lookup tables also use pass-transistor-based multiplexer circuitry but, the
multiplexer input and control connectivity is reversed. In a lookup table, SRAM cells connect to the
inputs of the multiplexer and hold the logic functions truth table, while the gates of the multiplexer
are controlled by the lookup table inputs. Consequently, lookup tables are generally implemented as
fully-encoded multiplexer trees, such that each level of the tree can be connected to a LUT input [7].
Figure 2.10 shows a 3-input fully encoded multiplexer tree lookup table followed by a two-stage buffer.
2.3.4 Flip-Flops
Flip-flops are generally implemented as standard master-slave registers [7]. However, some commercial
FPGAs use flip flops that are more advanved. For example, Altera’s Stratix V FPGAs use flip-flops
based on pulse latches and configurable pulse width generators to improve performance [35].
2.4 Modeling of FPGA Circuitry
Evaluating an FPGA architecture with the assessment methodology described in Section 2.2 requires that
we develop models that allow us to estimate the area and delay of FPGA circuitry because fabricating
an integrated circuit for each architecture to measure area and delay is obviously not practical. In this
section, we describe commonly used area and delay modeling approaches for FPGAs. These models are
also useful for transistor sizing, which we will discuss in Section 2.5.
Chapter 2. Background 13
Minimum-width transistor
Minimum-width transistor area
Space to neighboring transistors
Metal/polysilicon gate
Metal contact
Diffusion
Figure 2.11: Minimum-width transistor area model.
2.4.1 Area Modeling
Creating a complete layout is the best way to determine the exact area of an FPGA. However, this
process is much too time consuming when multiple designs need to be explored. A variety of different
approaches have been used to more quickly estimate area such as counting transistors or counting SRAM
cells, but the most widely used in FPGA research is the minimum-width transistor area model introduced
in [7]. In this model, layout area is expressed in units of minimum-width transistor areas. A minimum-
width transistor is defined as the smallest possible contactable transistor for a specific process technology
and one minimum-width transistor area is the area of this transistor plus the spacing to neighboring
transistors as shown in Figure 2.11. Unlike area models that simply count transistors or SRAM cells, the
minimum-width transistor area model provides an actual estimate of layout area. This is an important
distinction because as well as being more accurate, actual layout area estimates enable better estimates
of wire loads since wire lengths are layout dependent.
Transistors in FPGA circuitry often require more drive strength than that provided by a minimum-
width transistor. A transistor’s drive strength can be increased by either widening its diffusion region
(Figure 2.12b) or by adding parallel diffusion regions (Figure 2.12c). Consequently, increasing a tran-
sistor’s drive strength increases it’s area. The widely-used area model of [7] estimates the layout area
of a transistor with drive strength x, in units of minimum-width transistor areas, with (2.2), which was
obtained by averaging the layout areas that result from either widening the diffusion region or adding
parallel diffusion regions to increase drive strength.
Area(x) = 0.5 + 0.5x (2.2)
Then, [7] calculates the area of an FPGA subcircuit by simply summing the areas of all the transistors
in that subcircuit. Note from (2.2) that doubling a transistor’s drive strength does not double it’s area.
This is due to the fact that increasing a transistor’s drive strength only increases certain transistor
dimensions while others remain constant. For example, the spacing to neighboring transistors remains
the same regardless of a transistor’s drive strength.
Chapter 2. Background 14
1x minimum contactable
width
(a) Minimum drive strength.
2x minimum contactable
width
(b) 2× minimum drive strength.
1x minimum contactable
width
2 parallel diffusions
(c) 2× minimum drive strength.
Figure 2.12: Increasing drive strength with diffusion widening (b) or parallel diffusion regions (c). Note:Although not shown in the figure for simplicity, parallel diffusions must be connected together.
2.4.2 Delay Modeling
Time-domain circuit simulators such as HSPICE are generally the most accurate way to estimate the
delay of a circuit. However, time-domain simulation can be computationally intensive making it im-
practical when a large number of delay measurements need to be obtained. For example, the timing
analysis phase of the architecture assessment flow described in Section 2.2 involves measuring delay for
the thousands of nets in a benchmark circuit; performing time-domain simulation for each one would lead
to prohibitively long runtimes. Instead, previous FPGA research work has typically modeled wires and
transistors as linear resistances and capacitances, such that a transistor-based circuit can be modeled as
an RC-tree network [22, 7, 24]. The delay of this network can then be estimated with the Elmore [15] or
the Penfield-Rubinstein [20] delay models, which are much quicker than time-domain simulations. With
the Elmore delay model, the delay TD of a path is given by:
TD =∑
i ∈ path
Ri · C(subtreei) (2.3)
where Ri is the equivalent resistance of element i along the path and C(subtreei) is the total downstream
capacitance rooted at element i.
An enhanced version of the Elmore delay model was proposed in [39]. Since it is more difficult to
model a buffer as a simple RC circuit due to the buffer’s intrinsic delay, [39] combines the Elmore delay
model with a common model of buffer delay where a buffer is modeled as a constant delay and a resistor.
This approach maps well to FPGA circuitry, which consists mostly of pass-transistors and buffers, and
was adopted as the delay model for VPR in [7]. With this model, the delay TD of a path is given by:
TD =∑
i ∈ path
Ri · C(subtreei) + Tbuf,i (2.4)
where Tbuf,i is the buffer’s intrinsic delay if element i is a buffer or 0 otherwise [7].
2.5 Automated Transistor Sizing
Transistor sizing is a well-studied problem that consists of improving a circuit’s performance by increasing
the sizes of its transistors and thus provides yet another level, in addition to architecture and circuit
design, at which the area and delay characteristics of an FPGA can be adjusted. The transistor sizing
optimization problem is usually formulated in one of three ways:
Chapter 2. Background 15
1. Minimize some function of area and delay.
2. Minimize area subject to a delay constraint.
3. Minimize delay subject to an area constraint.
There has been much prior work on automated transistor sizing for custom circuitry. Fishburn and
Dunlop showed in [17] that modeling transistors as linear resistances and capacitances and calculating the
delay of the resulting RC circuits with the Elmore [15] or the Penfield-Rubinstein [20] delay model (i.e.
(2.3)) allows the transistor sizing problem to be formulated as a convex optimization problem, which
guarantees that any local minimum is the global minimum. With this useful property, [17] develops
TILOS, a transistor sizing tool for custom circuits based on a heuristic method that iteratively identifies
a circuit’s critical path and increases transistor sizes on that path until all timing constraints are met.
Despite the convexity of the problem, TILOS’s heuristic is such that it can terminate with a sub-
optimal solution [45]. Algorithms guaranteeing the optimal solution through convex optimization [44]
or mathematical relaxation techniques [10, 47] have subsequently been proposed but these algorithms,
along with TILOS, all suffer from their reliance on linear device models and the Elmore delay, which have
long been known to be inaccurate [40, 21]. To enhance accuracy, at the cost of increased computational
complexity, some transistor sizing algorithms have used time-domain simulation to obtain delay estimates
[14, 13].
The programmability of FPGAs adds unique features to the transistor sizing problem which makes
FPGA-specific transistor sizing tools valuable. Kuon and Rose proposed such a tool in [24]. Their FPGA
transistor sizing approach is different than transistor sizing algorithms for custom circuits because it deals
with two features unique to FPGAs. The first of these unique features is repitition. As described in
Section 2.1, an FPGA consists of an array of tiles. Since these tiles are all identical, transistor-level design
only needs to be performed for one of them. This design can then be replicated to obtain a complete
FPGA. Similar design space reductions can be found within a tile. For example, a switch block can
include over 100 logically equivalent multiplexers whose transistor-level design should be kept identical.
Consequently, only ∼80 unique transistors need to be sized when designing an FPGA’s soft-logic despite
there being billions of transistors on the chip, which is in contrast to transistor sizing for custom circuits
where the whole chip must be considered. This reduced design space makes HSPICE-based optimization
practical for FPGAs, but as we show in Section 3.7, we must still search this space intelligently to keep
runtime reasonable.
The second unique feature to FPGA transistor sizing is their undefined critical path. Because they are
programmable, FPGAs have application-dependent critical paths which implies that at design time, there
is no clear critical path to optimize for delay. To deal with this issue, [24] optimizes a representative
path that contains one of each type of FPGA subcircuit (LUTs, MUXes, etc.). Delay is taken as a
weighted sum of the delay of each subcircuit and the weighting scheme is chosen based on the frequency
with which each subcircuit was found on the critical paths of placed and routed benchmark circuits.
Optimizing a representative critical path still presents a huge design space which Kuon and Rose tackle
with a two-phased algorithm that consists of an exploratory phase that utilizes linear device models and
a TILOS-like transistor sizing heuristic to keep CPU times reasonable, followed by an HSPICE-based
fine-tuning phase that adjusts the transistor sizes to account for the inaccuracies of linear models.
In [46], Smith et al. present a method that enables the rapid and concurrent optimization of high-
level architecture parameters and transistor sizes for FPGAs through the use of analytic architecture
Chapter 2. Background 16
models, linear device models and a convex optimization-based transistor sizing algorithm. They show
that this concurrent optimization can have a significant impact on architectural conclusions versus a
separate optimization.
Chapter 3
COFFE: Automated Optimization of
FPGA Circuitry
When developing a new chip, FPGA architects are faced with two main tasks: choosing an architecture
for their FPGA and performing the transistor-level design of that architecture. As described in Section
2.2, choosing an architecture is typically accomplished experimentally with architecture exploration tools
such as VPR [7]. By implementing benchmark circuits on a proposed FPGA, these tools allow architects
to evaluate the area, delay and power impact of various architectural choices. Then, based on their
observations, architects can select an FPGA architecture that meets their design goals and constraints.
Transistor-level design consists of selecting circuit topologies for the various subcircuits that im-
plement the chosen architecture, as well as sizing the transistors of those subcircuits. Transistor-level
design is an essential precursor to the evaluation of an architecture because it provides accurate area,
delay and power estimates of the underlying FPGA circuitry; these estimates are required inputs to
the architecture exploration tools. Transistor sizing also provides an additional opportunity to tune the
area, delay and power of an FPGA. Therefore, developing a new FPGA is an iterative process that
involves performing the transistor-level design of various architectures before evaluating them through
synthesis, placement and routing experiments. This interdependence between architecture exploration
and transistor-level design necessitates automated design tools if high-quality results are to be obtained
in reasonable amounts of time.
In this chapter, we describe COFFE (Circuit Optimization For FPGA Exploration), a fully-automated
transistor sizing tool for FPGAs that we developed as part of this thesis. COFFE enables the design
flow detailed above by providing area, delay and power estimates of properly sized FPGA circuitry.
COFFE also enables design exploration of FPGA circuitry and we will use COFFE in such a capacity
in Chapter 4. Although COFFE solves the same problem as Kuon and Rose’s FPGA transistor sizing
tool [24] (see Section 2.5), we have made significant improvements which are necessary for FPGAs in
advanced process nodes; these improvements will be described in the following sections.
3.1 Introduction to COFFE
Figure 3.1 shows the FPGA design flow we wish to enable with COFFE. COFFE is used to perform
transistor-level optimization for some architecture of interest, thus producing accurate area and delay
17
Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 18
Wire load model
Circuit OptimizerArea model
HSPICE
Subcircuit SPICE netlists
Generate subcircuit SPICE
netlists
COFFE
Optimization objective
Process models
Subcircuit areas and delays
(VPR arch. file)
Typical critical path
(delay weights)
Architecture parameters
VPR
Pack
Place
Route
Benchmark circuits
Analyze timing
and area
Figure 3.1: FPGA design flow.
estimates for the subcircuits of this architecture (LUTs, routing multiplexers, etc.). These estimates are
used by VPR to evaluate the architecture through place and route experiments. Based on the results
of the assessment, the architecture parameters are adjusted and sent back to COFFE to begin a new
iteration of optimization and evaluation.
COFFE’s circuit optimizer makes area and performance trade-offs through transistor sizing. Like
[24], COFFE’s optimization objective is of the form AreabDelayc thus allowing for different area and
performance tradeoffs by varying b and c. Creating a complete layout is the most accurate way to obtain
the area and delay measurements needed during transistor sizing. However, for the iterative design flow
of Figure 3.1, this approach is impractical as layout is a very time consuming task. Instead, COFFE
estimates area with an improved version of the minimum-width transistor area model (see Section 3.4)
and measures delay with HSPICE simulations. Although previous FPGA transistor sizing tools have
used linearized models of transistors to measure delay during certain phases of the optimization, we
show in Section 3.5 that such models are highly inaccurate for the fine-grained transistor-level design we
wish to undertake in advanced process nodes such as the 22nm process we use in this work.
COFFE automatically generates the SPICE netlists required for delay measurement based on the
input architecture parameters and the circuit topologies described in Sections 3.2 and 3.3 respectively.
These netlists are parametrized such that COFFE’s circuit optimizer can change the transistor sizes
by simply changing a transistor size parameter list. To obtain meaningful delays, COFFE is careful to
ensure that these netlists include realistic transistor and wire loading. Transistor loads are relatively
easy to determine based on architectural parameters and circuit topologies. Wire loads, on the other
hand, are layout dependent making them more difficult to determine since the exact layout is not known.
COFFE estimates wire loads with the model described in Section 3.6.
3.2 Architecture
Figure 3.2 shows the tile architecture that COFFE supports in its designs and Table 3.1 lists the archi-
tecture parameters that COFFE expects as inputs. Parameters listed in the top portion of Table 3.1
Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 19
K-LUT FF
BLE with internal details shown
AB C
D
Logic Cluster
BLE
BLE
Total of N BLEs
Switch block MUX
Vertical routing channel
N•Ofb local feedback
wires
I cluster input wires
K local routing MUXes
per BLE
Connection block MUX
Ofb
Or
Horizontal routing channel
Figure 3.2: COFFE’s supported tile architecture.
are commonly used in FPGA research and were described in Sections 2.1.1 and 2.1.2. COFFE supports
routing wires of any length (L) but currently, all routing wires in a channel must be of the same length.
That is, architectures with multiple wire segment lengths, such as an architecture with both L = 4 and
L = 8 wire segments, are not supported. Note that COFFE uses directional, single-driver routing wires.
Parameters listed in the bottom portion of Table 3.1 are new and help COFFE describe a more flexible
BLE than the commonly used academic BLE shown in Figure 2.2. The COFFE BLE still consists of
a K-input LUT and FF but supports optionally including additional 2-input multiplexers to allow the
LUT and FF to be used simultaneously in many more ways. These extra 2:1 MUXes can potentially
help improve density and speed and are similar to the ones used in Stratix [34]. The new multiplexers
and their use are described below.
1. Register feedback multiplexer. A FF output driving a LUT input is a common occurrence in placed
and routed benchmark circuits. With the BLE of Figure 2.2, this requires driving the FF output
onto the cluster’s local interconnect and connecting this signal to a LUT input in another BLE.
Including a register feedback multiplexer (MUX-A in Figure 3.2) on a LUT input allows the FF
output to directly drive a LUT input in the same BLE, which saves routing resources in addition
to providing a faster connection. COFFE’s Rfb parameter allows optionally including a register
feedback multiplexer on a LUT input. Each LUT input has an Rfb parameter.
2. FF input selection multiplexer. The register feedback multiplexer (MUX-A) is made more useful
by also including a FF input selection multiplexer (MUX-B). With this multiplexer, the FF can
accept its input either from the LUT output or from a BLE input. For the “FF output driving
a LUT input” example described previously, MUX-B enables first registering a BLE input signal
before connecting it to a LUT input via MUX-A, all in the same BLE. The FF input selection
Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 20
Table 3.1: COFFE’s expected input architecture parameters.
Parameter Description
K LUT sizeN Cluster sizeI Number of cluster inputsFcin Cluster input connection flexibilityFcout Cluster output connection flexibilityW Routing channel widthL Wire segment lengthFS Switch block flexibilityFclocal Local interconnect to BLE input connection flexibility
Rfb Register feedback per LUT input (on/off)Rsel Register input select (LUT only/LUT and BLE input)Ofb Number of local feedback outputs per BLEOr Number of general routing outputs per BLE
multiplexer is also useful for another reason: it enables the use of the LUT and FF of the same BLE
for two completely unrelated functions. COFFE’s Rsel parameter allows the optional inclusion of
the FF input selection multiplexer.
3. BLE output multiplexers. Using the LUT and FF of a BLE for two unrelated functions requires
that we can drive both the LUT output and FF output onto the local interconnect or general
routing at the same time. Prior academic work has assumed one output per BLE, but COFFE
adopts a more general model with BLEs that support variable numbers of local feedback and
general routing outputs, as specified by the Ofb and Or parameters respectively. All BLE outputs
can be driven by either the LUT output or the FF output (MUX-C and MUX-D).
COFFE uses a two-sided architecture, which means logic block inputs and outputs can only access
the two routing channels (one vertical and one horizontal) which run over top of the tile, as shown in
Figure 2.1. Implementing the routing wires over the logic in this way leads to a more efficient layout
and is the common commercial practice. Four-sided architectures (capable of accessing 2 vertical and 2
horizontal channels) have often been assumed in prior work but are less realistic since such architectures
would be difficult to lay out when the routing wires run over the logic. We have performed VPR
experiments with the architecture described in Section 3.8.1 (N = 10, K = 6) and found that using
the more realistic two-sided architecture results in a 3-4% critical path delay increase and 8-9% routed
wire length increase over a four-sided architecture. Note from Figure 3.2 that COFFE does not include
track buffers on connection block multiplexer inputs, which have often been used in academic work [7],
because these buffers are difficult to lay out and are not used in modern commercial architectures.
3.3 Circuit Topologies
COFFE uses a two-level multiplexer topology as shown Figure 3.3a for all multiplexers except for the
2:1 multiplexers inside the BLE, which are implemented with a single level and a shared SRAM bit as
shown in Figure 3.3b. An important parameter in the design of two-level multiplexers is the size of each
level. If S1 and S2 are the sizes of the first and second levels respectively, any combination of S1 and S2
Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 21
Level-restorer
out
SRAM cell
2-stage buffer
buf1
buf2
lvl1 lvl1
lvl2
2-level MUX
(a) Routing multiplexer topology.
Level-restorer
out
SRAM cell
2-stage buffer2:1 MUX
(b) 2:1 multiplexer topology.
Figure 3.3: COFFE’s routing multiplexer circuit topologies.
such that S1 × S2 = MUXsize is a possible MUX topology. Since SRAM cells typically occupy 35-40%
of tile area (as shown in Section 4.2.6), we choose a multiplexer topology that minimizes the number
of SRAM cells required by having S1 ≈ S2. The output of each multiplexer is driven by a two-stage
buffer enabling it to drive a downstream capacitance that is frequently large. Note from Figure 3.3 that
COFFE includes PMOS level restorers to help pull degraded pass-transistor outputs to VDD.
COFFE implements LUTs as fully encoded multiplexer trees. Since the delay of a chain of pass-
transistors increases quadratically, COFFE supports internal re-buffering along a chain of pass-transistors.
Figure 3.4 shows a portion of a 6-LUT with internal re-buffering after 3 levels of pass-transistors. We
also experimented with an internal re-buffering topology that consisted of placing two separate inverters
along the pass-transistor chain: one after 2 pass-transistors and one after 4 pass-transistors. However,
due to the degraded output of pass-transistors, this topology requires a PMOS level-restorer at each in-
verter and skewed P/N ratios, making the two-stage buffer topology shown in Figure 3.4 more efficient.
The architectures that we use in this work are all 6-LUT architecture and are based on the circuit topol-
ogy shown in Figure 3.4. Isolation inverters between the SRAM cells and the pass-transistor tree are
also included in COFFE’s LUT topologies. These inverters improve robustness by isolating the SRAM
cells from the constantly switching pass-transistor tree, thus preventing capacitive noise injection, which
could upset an SRAM cell. As well, the isolation inverters improve speed since they are larger than the
inverters inside the SRAM cell and can better drive a signal down the pass-transistor tree. Each level of
the pass-transistor tree is controlled by a LUT input driver. Each LUT input driver will have to drive
a different load and so COFFE sizes each one distinctly.
Finally, similar to [7], COFFE implements flip-flops as a static transmission gate-based master-slave
register as shown in Figure 3.5 . As we will show in Section 4.2.6, the impact of the flip-flops on critical
path delay and tile area is relatively small. The former is due to the fact that a timing path can consists
of at most two flip-flops: one at the beginning and one at the end. The latter is due to the fact that flip-
flop circuitry is inherently small compared to lookup tables and far outnumbered by routing multiplexer
circuitry. Consequently, we did not explore different flip-flop implementations.
Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 22
IN_A IN_B IN_C IN_D IN_E IN_F
SRAM
LUT input drivers
Figure 3.4: Fully encoded MUX tree 6-LUT with internal re-buffering (partial view).
set
set reset
reset
clk
clk
clk
clk
D Q
Figure 3.5: Static transmission gate-based master-slave register.
3.4 Area Modeling
The most accurate way to determine the area of an FPGA is to create a complete layout. However,
as described in Section 3.1, designing an FPGA is an iterative process, making layout of each iteration
impractical. FPGAs are generally transistor area limited [7]. Hence, a fast-to-compute but accurate
estimate of transistor layout area is needed. In this work, we use the minimum-width transistor area
model described in Section 2.4.1 but we make a number of improvements to enhance its accuracy in
advanced process nodes.
Figure 3.6 shows that (2.2) over-predicts transistor area by as much as 143% compared to our manual
layouts with TSMC’s 65nm layout rules (which were the most advanced layout rules to which we had
access). In [24], the constants in (2.2) were adjusted to match more advanced process rules but its
area estimates for our 65nm layouts are still inaccurate (Figure 3.6). Consequently, we develop a new
version of the minimum-width transistor area model whose accuracy is improved in several ways. First,
we assume reasonably square layouts. Recall from Section 2.4.1 that, to obtain (2.2), [7] averages the
Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 23
0 5 10 15 20 25 30 35Drive Strength (xMin.)
0
2
4
6
8
10
12
14
16
18
Min
imum
Wid
thTr
ansi
stor
Are
as
LayoutOriginal Model [7]Kuon and Rose [24]Improved Model
Figure 3.6: Transistor area prediction accuracy of original (Eq. 2.2) and improved (Eq. 3.1) area modelsagainst TSMC 65nm layouts.
layout areas that result from either widening the diffusion region or adding parallel diffusion regions to
increase drive strength. For large transistors, however, both approaches yield layouts with very high
aspect-ratios (see Figures 3.7a and 3.7b). We found that smaller area can be obtained by keeping a
reasonably square transistor layout, which is accomplished by combining both diffusion widening and
parallel diffusion regions to increase a transistor’s drive strength as in Figure 3.7c. Therefore, our manual
layouts in Figure 3.6 use square layouts.
Second, we develop a new transistor area equation tailored towards more advanced process tech-
nologies by using a least-square fit of our 65nm layout areas versus drive-strengths to obtain area as a
function of drive-strength.
Area(x) = 0.447 + 0.128x+ 0.391√x (3.1)
Figure 3.6 shows that (3.1) predicts transistor area with much more accuracy than (2.2). We make
two further enhancements to the model to better estimate the layout density of different structures. The
area model described thus far does not account for the fact that in a design with both NMOS and PMOS
transistors, extra spacing is required for N-wells. It would be pessimistic to assume that each PMOS
transistor is in a separate well as the amount of N-well spacing required can be reduced by placing
multiple PMOS transistors in the same well. Although it is difficult to predict how much well-sharing
is possible in a given layout, our sample layouts suggest that well sharing can reduce the per-transistor
well spacing required by approximately 75% (see Appendix A for sample layout). With this estimate,
we derive the following equation to calculate the area of transistors requiring N-well spacing.
Area(x) = 0.518 + 0.127x+ 0.428√x (3.2)
COFFE calculates the area of NMOS pass-transistors with (3.1) and the area of CMOS transistors
Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 24
15x minimum contactable
width
(a) 15× minimum drive strength achieved through diffusion widening.
1x minimum contactable
width
15 parallel diffusions
(b) 15× minimum drive strength achieved through parallel diffusion regions.
5x
min
imu
m
con
tact
able
wid
th
3 parallel diffusions
(c) 15× minimum drive strength achieved through both diffusion widening and parellel diffusion regions.
Figure 3.7: Combining diffusion widening and parallel diffusion regions yields denser layouts (c).
Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 25
Cdiff CdiffReq
Cgate
S D
G
S D
GS D
G
Figure 3.8: A switch-level model.
(e.g. inverters and transmission gates) with (3.2). We find that accounting for N-well spacing increases
our tile area estimates by ∼2% for a pass-transistor based FPGA.
Finally, despite the fact that 6 small transistors are required per SRAM cell, we use an area of 4
minimum-width transistors because a denser, more optimized layout is typical for such a frequently used
cell.
3.5 Delay Modeling
We define switch-level modeling as the characterization of complex, non-linear MOSFET transistors into
a set of linear resistances and capacitances (Figure 3.8). Although they are less accurate at modeling
transistor behavior than circuit simulators like SPICE, switch-level models are often used for delay
estimation [40, 17, 24, 46] because the delay of the equivalent RC circuits can be computed with the
Elmore [15] or the Penfield-Rubinstein [20] delay models which are much quicker than the time-domain
simulations required to measure delay with SPICE. In addition, [17] showed that when transistors are
treated as linear resistances and capacitances, the transistor sizing problem can be formulated as a
convex optimization problem, thus guaranteeing that a local minimum is always the global minimum.
The resistive and capacitive behavior of a transistor is influenced by a variety of factors such as
its operating-point, its size and the shape of the input waveform. Therefore, switch-level models are
most accurate when estimating delay for circuits that exhibit a high degree of regularity (e.g. a circuit
composed of a few basic gates with a limited number of P/N ratios) because many transistors will
experience similar operating conditions. Different resistance and capacitance values (Req, Cgate and
Cdiff ) can be used for each group of transistors experiencing similar operating conditions to construct
a reasonably accurate switch-level model.
FPGA circuit design consists of custom, fine-grained transistor-level design which can lead to a large
variation in transistor operating conditions. Using PTM 22nm HP device models [42] and HSPICE, we
experimented with switch-level modeling for some of the circuit topologies commonly used in FPGAs.
In the following sections, we highlight some of the reasons why we found that switch-level models were
not suitable for our purposes.
3.5.1 Non-Linearity of Transistor Resistance and Capacitance
We use a chain of five loaded inverters (Figure 3.9a) to find the equivalent switching resistance for the
NMOS and PMOS of an inverter. Using a large Cload to minimize the effects of transistor capacitances,
Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 26
CloadCloadCloadCloadCload
trisetfall
(a)
Cload
trise
(b)
Cload
tfall
(c)
Cload
trise
(d)
Cload
tfall
(e)
Figure 3.9: Circuits used to measure transistor resistance.
we simulate this circuit with HSPICE for several transistor widths and measure the rise and fall times of
the third inverter in the chain (to avoid end-effects). The rise time, trise, is measured as the time it takes
for the inverter output to rise from 0V to VDD/2 and the fall time, tfall, the time it takes for the output
to fall from VDD to VDD/2. Delay measurement in both cases starts when the input of the inverter is
at VDD/2. With the rise and fall times, NMOS and PMOS switching resistances can be computed as
RN = tfall/Cload and RP = trise/Cload. As shown in Figure 3.10, our experiments show that transistor
resistance varies with transistor width, particularly for smaller transistors. We found the same to be
true for transistor capacitance. This implies that transistor resistance and capacitance are non-linear
functions of transistor width and as a result, an accurate switch-level model would require a table of
pre-computed resistances and capacitances for many different transistor widths.
3.5.2 Topology Dependence of Transistor Resistance
The switching resistance of an NMOS pass-transistor, a key building block for FPGA circuitry, is different
than that of an NMOS in an inverter. Furthermore, the resistance of a pass-transistor is different
during rising and falling transitions due to the NMOS’s inability to propagate a full rising transition.
Using HSPICE simulations, we measure the resistances of an NMOS pass-transistor by charging and
discharging a large capacitor through a single pass-transistor (Figures 3.9b and 3.9c). Again, trise and
tfall are measured from VDD/2.
In Table 3.2, we compare the rising and falling resistance of a pass-transistor to the resistance of
an NMOS in an inverter for a 4× minimum-width transistor. We can clearly see that the resistance
of the NMOS in the inverter (3.8k) is different from the rising (13.7k) and falling (1.9k) resistances of
a pass-transistor. The very large rising resistance is caused by the pass-transistor’s degraded output
voltage. It is possible to achieve more balanced rising and falling resistances by measuring trise and
Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 27
0 5 10 15 20 25 30 35Transistor Width (xMin. width)
650
700
750
800
850
900
950
Res
istiv
ity(Ωµ
m)
NMOSPMOS
Figure 3.10: Inverter NMOS and PMOS resistivity vs. transistor width.
Table 3.2: Resistance of a 4× minimum-width NMOS transistor for different circuit topologies (Figure3.9) and switching-thresholds.
Circuit Topology Transition Type Switching-Threshold Resistance (kΩ)
Chain of 5 inverters fall VDD/2 3.8Single pass-transistor fall VDD/2 1.9Single pass-transistor rise VDD/2 13.7Single pass-transistor fall VDD/3 2.7Single pass-transistor rise VDD/3 2.22 series pass-transistor fall VDD/3 2.82 series pass-transistor rise VDD/3 3.3
tfall at VDD/3 instead of VDD/2 which in terms of circuit design, corresponds to lowering the switching-
thresholds of downstream inverters by skewing their P/N ratios. As shown in Table 3.2, at VDD/3 the
rising and falling resistances of a pass-transistor are 2.2k and 2.7k respectively. Table 3.2 also shows
that the resistance of an NMOS in a chain of 2 series connected pass-transistors (Figures 3.9d and 3.9e)
is different from both the single pass-transistor and the inverter.
The results of Table 3.2 demonstrate that the custom pass-transistor based topologies of FPGA
circuitry do not lend themselves well to switch-level modeling. Not only does resistance depend on
circuit topology, it also depends on the switching-threshold of downstream inverters and on transistor
dimensions (Section 3.5.1). The complexity of a switch-level model sufficiently accurate for the type
of fine-grained transistor level design we wish to undertake is impractical so we rely solely on circuit
simulation to estimate delay.
Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 28
3.6 Wire Load Modeling
Past FPGA transistor sizing efforts have often only accounted for the loading effects of long wires such
as the general routing wires or the cluster local interconnect wires (i.e. the cluster input wires and local
feedback wires of Figure 3.2). In reality, an FPGA contains much more metal wiring. Ignoring this extra
metal is increasingly problematic, as the delay impact of wires is becoming ever more important with
shrinking feature sizes [18]. Accordingly, COFFE models all wire loading, even including the relatively
short metal connecting two transistors inside a multiplexer. COFFE estimates wire lengths based on
area estimates obtained with the model of Section 3.4 along with the following set of general layout
assumptions.
1. The layout of a sub-block (e.g. a multiplexer, a BLE, a logic block, etc.) is assumed to be square
such that its width is equal to its height.
2. The length of a wire that broadcasts a signal across a sub-block is equal to the width (which equals
the height) of that sub-block.
3. The length of a point-to-point wire between two sub-blocks is equal to 1/4 the sum of the width
of both sub-blocks.
For example, cluster local interconnect wires are broadcast wires with a single source and many
destinations so they span the height of a logic block. Wires that connect two inverters together inside a
buffer are point-to-point wires; they span 1/4 the width of each inverter.
The resistance and capacitance of a wire are obtained from its length estimate, as well as its metal
layer. The resistance and capacitance values of each metal layer are inputs to COFFE. COFFE imple-
ments all wires in the same metal layer, with the exception of general routing wires, which are placed in
a separate metal layer. Placing the general routing wires on a different metal layer than the other wires
allows one to use a wider wire pitch when computing the resistance and capacitance values of the general
routing wires layer as these wires benefit from lower resistance. To speed up SPICE simulation, rather
than modeling a wire as a distributed RC transmission line, COFFE includes the equivalent π-model in
the generated SPICE netlists.
3.7 Transistor Sizing Algorithm
When transistors are treated as linear resistances and capacitances, the transistor sizing problem can be
formulated as a convex optimization problem [17]. Such a formulation has the highly useful property that
there is only one minimum: the global minimum. Past transistor sizing algorithms have exploited this
fact by either making a series of local optimizations in hopes of eventually reaching the global minimum
[17, 24] or by making use of mathematical programming techniques [44, 10, 47, 46]. In Section 3.5, we
showed that it is very difficult to obtain linear models of transistors that are sufficiently accurate for
the fine-grained transistor-level design of FPGA circuitry in advanced process nodes. Instead, we chose
to use HSPICE simulations to measure delay, which produces more accurate delay estimates, but also
makes the shape of the optimization space more ambiguous. Therefore, COFFE takes a more exhaustive
optimization approach and searches for a minimal cost solution by simulating many possible transistor
sizing combinations over a range of transistor sizes. Exhaustively searching the entire optimization space
Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 29
in this way would lead to prohibitively long runtimes because there are ∼80 unique transistors to size
in one FPGA tile and sweeping each transistor over ∼10 sizes would require 1080 HSPICE simulations.
COFFE uses two techniques to confront this problem: divide-and-conquer and pre-determined P/N
ratios.
3.7.1 Divide-and-Conquer
COFFE reduces the transistor sizing combinations to examine by splitting the FPGA into loosely coupled
subcircuits S1, S2, ..., SN and sizing each Sj individually. This divide-and-conquer approach reduces
the search space but requires iteration to account for changes in loading. More specifically, since subcir-
cuits are usually loaded by other subcircuits, changing the transistor sizes of one subcircuit will change
the load on another. Because of this, COFFE performs multiple FPGA sizing iterations in which it sizes
each subcircuit Sj once with the loading coming from the last sizing of the other subcircuits. FPGA
sizing iterations are performed until no reduction in cost is achieved (implying loading has stabilized)
or until a maximum number of iterations have been completed. In our experience, COFFE finds a
transistor sizing solution after 2-4 iterations.
3.7.2 Pre-Determined P/N Ratios
COFFE further reduces the number of transistor sizing combinations evaluated by using pre-determined
P/N ratios to size the NMOS and PMOS transistors of inverters as a unit instead of as individual
transistors. Although COFFE chooses these P/N ratios intelligently, they may not always be the best
choice for every transistor sizing combination. Therefore, COFFE later re-computes the P/N ratios
of the best candidate transistor sizing solutions, ensuring that the P/N ratios of the final solution are
tailored to it.
3.7.3 Detailed Algorithm
The steps listed below describe how COFFE sizes a subcircuit (the step numbers match the labels in
Figure 3.11). To give a concrete example, we will assume that the subcircuit being sized (Sj) is a
two-level multiplexer such as the one shown in Figure 3.3a.
1. First, COFFE selects transistor sizing ranges for subcircuit Sj based on the initial sizes of transis-
tors in that subcircuit. These initial transistor sizes originate from one of two places. On the first
FPGA sizing iteration, the initial transistor sizes are their starting sizes, which are hard-coded in
COFFE and were chosen based on designer intuition (however, since COFFE sweeps transistor
sizes over many possible values, we believe that the transistor sizing algorithm is not very sensitive
to the starting sizes of transistors). On subsequent FPGA sizing iterations, the initial sizes are the
sizes obtained on the previous FPGA sizing iteration (i.e. the previous transistor sizing solution).
To explore the impact of both growing and shrinking transistor sizes, COFFE chooses transistor
sizing ranges that place the initial transistor sizes near the center. For example, if transistor lvl1
has an initial size of 6, COFFE would choose a transistor sizing range of 1→ 10 for this transistor
(assuming the size of this transistor is swept over 10 integer values). Note that, by default, COFFE
uses integer granularity when sweeping transistor sizes.
Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 30
Subcircuit Sizing
FPGA Sizing Iteration
Yes
Yes
Yes
No No
No
FPGA
Yes
Find initial transistor sizing ranges for Sj
Sj,sol on range boundaries?
Equalize trise and tfall for mid-range Sj,i
Calculate area Aj,i for each Sj,i
Measure delay Tj,i = (trise+tfall)/2 for each Sj,i
Calculate cost Cj,i = cost(Aj,i,Tj,i) for each Sj,i
Sj = S1Split FPGA into
subcircuits S1, …, SN
Sj = Sj+1
Sj = SN?No more cost
reductions or max iterations?
Transistor sizing solution
Sj,ii = 1, 2, …, K = set of all transistor sizing combinations in sizing ranges
Sj,sol = Sj,i with minimum Cj,i
Equalize trise and tfall for M Sj,i with lowest Cj,i and update Tj,i = max(trise,tfall) and Cj,i
Adjust sizing ranges around Sj,sol
1
2
3
4
5
6
7
Figure 3.11: COFFE’s transistor sizing algorithm.
2. With the transistor sizing ranges chosen in step 1, COFFE creates a set Sj,ii=1,2,...,K of transistor
sizing combinations to explore. For example, there are four sizeable elements in the subcircuit of
Figure 3.3a: lvl1, lvl2, buf1 and buf2. If each transistor sizing range consists of 10 values, the set
Sj,i would consist of all 10, 000 possible transistor sizing combinations in these ranges.
3. Recall that, to reduce the search space, COFFE sizes the NMOS and PMOS of inverters as a unit
based on pre-determined P/N ratios. COFFE computes these P/N ratios by equalizing the rise
and fall times of the inverters of subcircuit Sj (buf1 and buf2) for a mid-range transistor sizing
combination. Although the transistor sizes in this mid-range combination can sometimes be the
same as the initial transistor sizes of step 1, they can also be different. For example, if the initial
size of transistor buf1 was 1, COFFE would choose a transistor sizing range of 1 to 5 as it is
impossible to make buf1 smaller than 1. Consequently, COFFE would use a size of 3 for buf1
when computing these P/N ratios.
4. Using the P/N ratios of step 3, COFFE calculates area (Aj,i) and measures delay (Tj,i) for each
transistor sizing combination (Sj,i) of the set created in step 2. Since the rise and fall times (trise
and tfall) will not remain perfectly balanced as we evaluate different transistor sizing combinations,
COFFE uses the average of the rise and fall times in this step (tavg = (trise + tfall)/2) because
Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 31
Table 3.3: Rise-fall re-balancing and the effect of M on COFFE’s transistor sizing solutions (example).
(a) Six top-ranked transistor sizing combinations before re-balancing.
SizingCost tfall (ps) trise (ps)
Combination
A 1.000 161 163B 1.006 157 162C 1.010 168 165D 1.013 189 179E 1.017 163 164F 1.021 196 181
(b) Re-balancing the top-ranked solution (M = 1).
SizingCost tfall (ps) trise (ps)
Combination
A 0.983 156 156
B 1.006 157 162C 1.010 168 165D 1.013 189 179E 1.017 163 164F 1.021 196 181
(c) Re-balancing the 5 top-ranked solution (M = 5).
SizingCost tfall (ps) trise (ps)
Combination
B 0.982 153 153A 0.983 156 156E 0.987 157 157C 0.995 161 161D 1.052 181 180
F 1.021 196 181
in step 5, COFFE will re-balance the P/N ratios of candidate transistor sizing solutions and this
re-balancing typically makes the final delay closer to tavg than to max(trise, tfall). With its area
(Aj,i) and delay (Tj,i), COFFE computes the cost (Cj,i) of each transistor sizing combination (Sj,i)
based on the desired optimization objective.
5. COFFE sorts the transistor sizing combinations based on cost from lowest to highest and then
re-balances the rise and fall times on a user-specified M number of top-ranked transistor sizing
combinations. Since this re-balancing changes the area and delay of these M top-ranked transistor
sizing combinations, their area (Aj,i), delay (Tj,i) and cost (Cj,i) are updated. This final rise-fall
re-balancing may re-order the final ranking. This is exemplified in Table 3.3 where Table 3.3a
shows the six top-ranked transistor sizing combinations for our example subcircuit. Notice how
the rise and fall times are not all perfectly balanced due to the use of pre-computed P/N ratios.
Tables 3.3b and 3.3c show what happens to the final ranking when COFFE re-balances the rise and
fall times for M = 1 and M = 5 respectively. Two important observations must be made. First,
re-balancing the rise and fall times generally yields a more efficient solution. Second, re-balancing
many top-ranked transistor sizing solutions can change the final ranking. Note from Tables 3.3b
and 3.3c that after re-balancing, the delay of our example subcircuit (Figure 3.3a) is often smaller
than both the trise and tfall of Table 3.3a. This happens because there are two inverters in series
in this subcircuit and re-balancing the P/N ratios of both of them can improve both the rising and
falling transition times.
6. After re-balancing the rise and fall times on the M top-ranked transistor sizing combinations,
COFFE chooses the minimum cost solution Sj,sol from those M transistor sizing combinations.
Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 32
7. Once the best cost sizing combination Sj,sol has been selected, COFFE checks if Sj,sol is on one
of the boundaries of the transistor sizing ranges of step 1. For example, if the sizing range of
transistor lvl1 was 1 → 10, and Sj,sol specifies that lvl1 = 10, then Sj,sol is on the boundary of
lvl1’s sizing range. If this is the case, we may not have explored a large enough size range as lvl1
may benefit from being made bigger than 10. Consequently, when Sj,sol is on a transistor sizing
range boundary, COFFE adjusts the transistor sizing ranges around Sj,sol (e.g. lvl1’s new sizing
range would be 5 → 14) and the process is repeated (return to step 2) until a solution that is
contained entirely within the sizing ranges is found, implying that we have searched a large enough
size range.
With divide-and-conquer and inverter rise-fall balancing, we reduce the number of transistor sizing
combinations to examine from ∼1080 to the much more tractable number of ∼3 × 12 × 104. That is,
for ∼12 subcircuits containing ∼4 sizeable items (transistors or inverters), we try ∼10 possible sizes per
sizeable item. This is done ∼3 times to account for changes in loading (i.e. an FPGA sizing iteration).
Total runtime is ∼4h for M = 1 or ∼10h for M = 5 on a single Intel Xeon E5-1620 3.6GHz processor
core.
3.8 Impact of Improved Wire Load Modeling
In this section, we examine the impact of improved wire load modeling on the area and delay of an
FPGA by using COFFE to perform transistor sizing under different wire loading scenarios. Section
3.8.1 describes the architecture that we will use to perform this analysis. This architecture will also be
the base architecture for our circuit design investigations in Chapter 4. Similarly, Section 3.8.2 describes
our target process technology for both this section and Chapter 4.
3.8.1 Base Architecture
For our circuit design investigations to be of value, it is important that we conduct our experiments on
FPGA architectures that are known to be good and that are relevant to current commercial FPGAs. We
use COFFE to perform the transistor-level design for all our experiments and thus our architecture style
matches the one described in Section 3.2. Unless stated otherwise, we use the architecture parameters
listed in Table 3.4. Most of these parameters were selected based on results from prior work or common
commercial practices (see Section 2.1). The number of cluster inputs (I) is set to 40 based on (2.1) plus 7
spare cluster inputs which are required to keep the sparsely populated local interconnect (Fclocal = 0.5)
highly routable [29]. Our routing channel width is set to W = 320 by adding 30% more routing tracks
to the minimum channel width required to route our biggest benchmark circuit. We chose this channel
width because some of our wire loading transistor sizing depends on the absolute channel width, and
it is common in commercial FPGAs to choose a channel width sufficient to route even difficult circuits.
Since the architecture we use is fairly different from prior work in terms of logic cluster outputs (e.g.
two outputs per BLE and single-driver routing wires), Fcout is determined experimentally. In Section
4.1, we show that for this architecture, an Fcout = 0.025W produces an FPGA with the best area-delay
product. Table 3.5 details the subcircuits per tile for this architecture.
Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 33
Table 3.4: Base architecture parameters.
Parameter Value Parameter Value
K 6 FS 3N 10 Fclocal 0.5I 40 Rfb “on” for LUT-input CFcin 0.2 “off” for all other LUT inputsFcout 0.025 Rsel LUT & BLE inputW 320 Ofb 1L 4 Or 2
Table 3.5: Subcircuit count per tile for base architecture.
Subcircuit Size Count
Local routing multiplexers 25:1 60Connection block multiplexers 64:1 40Switch block multiplexers 10:1 160BLEs – 10SRAM cells – 3050
Table 3.6: Metal layer data used by COFFE for all circuit design investigations (ITRS [19]).
Metal Layer Half-Pitch Aspect RatioR C
(Ω/µm) (fF/µm)
Intermediate 24nm 1.9 54.825 0.175
Semi-global 48nm 2.12 7.862 0.215
Global 96nm 2.34 1.131 0.250
3.8.2 Target Process Technology
We wish to investigate a number of circuit design questions in advanced process technology. Therefore,
we use PTM 22nm HP predictive SPICE models [42] for all our investigations. The nominal supply
voltage in this process is VDD = 0.8V . We extract wire resistance and capacitance per unit length for
a 22nm process from ITRS 2011 [19]. The metal stack extracted from ITRS is shown in Table 3.6.
We implement all wires in the intermediate layer (minimum width and spacing) except for the general
routing wires which we implement in the semi-global layer (2× minimum width and spacing) for its lower
resistance.
3.8.3 Results
We begin by using COFFE to size the transistors of our FPGA without including the loading effects of
any wires. COFFE’s optimization objective is set to minimize the product of tile area and representative
path delay and we re-balance the rise and fall times of the 5 top-ranked transistor sizing combinations
(M = 5). The resulting tile area and representative path delay are shown in the first row of Table 3.7.
Then, we gradually add groups of wires to our FPGA, re-sizing its transistors after every addition. As
shown in Table 3.7, each time we add wires, we observe an increase in delay as well as an increase in tile
area because COFFE chooses larger transistor sizes in an effort to cope with the extra wire loading.
Chapter 3. COFFE: Automated Optimization of FPGA Circuitry 34
Table 3.7: Impact of wire loading.
Wire loadTile area Delay
(µm2) (ps)
No wires 836 58Inter-cluster routing only 899 79Inter-cluster routing & cluster local interconnect 905 85Inter-cluster routing, cluster local interconnect & logic-to-routinga 919 98All wires 938 112
aWe use an input track-access span of 0.5 and an output track-access span of 0.25 for logic-to-routingwires in this section. See Section 4.4 for definition of track-access span.
Table 3.7 clearly shows that it is important to account for the effects of more than just the inter-
cluster routing wires. In fact, 24% of the delay comes from two groups of wires that have often been
overlooked in prior work: logic-to-routing wires and smaller wires like those inside multiplexers and
lookup tables (which are included in the All wires row of Table 3.7). The logic-to-routing wires are those
that connect specific routing tracks to cluster inputs (through connection block multiplexers) as well
as cluster outputs to specific routing tracks (through switch block multiplexers) and they can span a
significant fraction of a tile. We study the impact of the lengths of these wires in more detail in Section
4.4.
3.9 Integration of COFFE with VPR
As illustrated in Figure 3.1, using COFFE with VPR enables more thorough architecture investigations.
To obtain the most accurate investigations, VPR’s area and delay models need to be aligned with the
enhancements made by COFFE. Therefore, the following code changes were made to VPR and will
appear in version 8.0.
1. VPR can calculate transistor area with (3.1), the improved area estimation equation, instead of
(2.2), the original equation. An option at the top of rr graph area.c, which must be set before
compilation, allows one to select which equation to use. VPR uses (3.1) by default.
2. VPR now assumes an area of 4 minimum-width transistors for SRAM cells instead of 6 (const
float trans sram bit = 4. in rr graph area.c).
3. A new option at the top of rr graph area.c allows VPR to optionally include track buffers in its
area calculations instead of always including them. By default, they are not included.
4. Similarly, an option allows VPR to optionally include track buffers in its loading and delay models.
The area and delay outputs of COFFE can be used to create VPR architecture files for architecture
exploration. In Section 4.4, we created a number of VPR architecture files for our circuit design exper-
iments. These architecture files are available at: http://www.eecg.utoronto.ca/~vaughn/software.
html. All architecture files describe an identical architecture (the base architecture of Section 3.8.1),
but they have different area and delays due to different circuit designs (e.g. switch type (pass-transistor
or transmission gate) and voltage levels (supply voltages and gate voltages)). These architecture files
are highly commented such that they are self-documenting.
Chapter 4
Efficient FPGA Circuitry
Circuit design is a crucially important part of obtaining efficient FPGAs. While the architecture of
an FPGA defines the style and flexibility of its resources, the circuit design of those resources is what
defines the area, delay and power of the architecture. In Chapter 3, we described COFFE, an automated
FPGA transistor sizing tool that produces the accurate area, delay and power estimates of properly sized
FPGA circuitry needed during architecture exploration. However, COFFE can be used in a different
role. An automated transistor sizing tool also enables design space exploration of FPGA circuitry as
well as investigations relating to the interaction between architecture and circuit design. In this chapter,
we use COFFE to explore a number of such circuit design related questions. Unless stated otherwise,
the base architecture described in Section 3.8.1 and the process technology data described in Section
3.8.2 are used for the experiments in this chapter
4.1 Fcout for Single-Driver Routing and Multiple BLE Outputs
Previous work has shown that Fcout = W/N is an appropriate cluster output pin flexibility [7], which
for our architecture would lead to Fcout = 0.1W . However, our cluster output architecture differs
significantly from that of [7]: it has two outputs per BLE and single-driver routing wires. Therefore, we
re-investigate cluster output pin flexibility. The area tradeoffs are as follows. Smaller Fcout values lead
to smaller switch block multiplexers as there are fewer connections from the cluster outputs to routing
wires. However, larger channels are needed due to poorer routability, leading to a larger number of
switch block multiplexers. The delay tradeoffs are similar. Smaller values of Fcout reduce loading and
lead to faster cluster outputs but might lead to circuitous routing.
We use VPR to place and route the MCNC benchmark circuits [53] on three architectures with differ-
ent values of Fcout to find an equally routable channel width for each architecture (i.e. same W/Wmin
where Wmin is the average minimum channel width required to successfully route the benchmarks).
Table 4.1 shows the number of switch block multiplexers per tile required for each architecture as well as
the size of these multiplexers. We use COFFE to size the transistors of all three architectures of Table
4.1. Then, we place and route the MCNC benchmark circuits for each architecture using the channel
widths of Table 4.1 to obtain the tile areas and critical path delays shown in Table 4.2. Based on these
results, we find that Fcout = 0.025W gives the best area-delay product for our N = 10, K = 6 and
Fcin = 0.2W architecture. We did not explore values of Fcout smaller than 0.025 because very low
35
Chapter 4. Efficient FPGA Circuitry 36
Table 4.1: Effect of Fcout on channel width and switch block multiplexers.
Fcout W # of SB MUXes per Tile SB MUX Size
0.250 288 144 19:10.100 296 148 13:10.025 320 160 10:1
Table 4.2: Area and delay for different Fcout values.
Fcout WTile Area Critical Path Delay
Area-Delay Product(µm2) (ns)
0.250 288 936 7.96 1.000.100 296 891 7.83 0.950.025 320 873 7.84 0.92
cluster output pin connectivities can be problematic for some channel widths. That is, if Fcout is too
low, it might not be possible to drive all starting wires in an adjacent switch block with a logic block
output. For the architecture that we use in this work, an Fcout = 0.025 results in each starting wire in a
switch block being driven by a single logic block output. Since single-driver routing reduces the portion
of a routing channel that can be accessed by logic cluster outputs to W/L, it seems intuitive that Fcout
should be lower than it is for architectures with tri-state driver routing [7] where the whole channel is
accessible. Also, prior work has not modeled Fcout dependent wire loading in detail, while COFFE lets
us take this detailed interaction into account in our architecture study. For all other experiments in this
work, we use an Fcout = 0.025W .
4.2 Transmission Gate FPGAs
4.2.1 Pass-Transistor Scaling Challenges
Although they are incapable of fully passing logic-high voltages, NMOS pass-transistors have been widely
used in commercial and academic FPGA circuitry due to the very small switch they enable. To benefit
from the area advantage of pass-transistors without suffering from an excessive amount of static power
dissipation due to their reduced voltage swing, FPGA circuitry typically includes a combination of
gate boosting and PMOS level-restorers to help pull degraded pass-transistor output voltages up to
VDD (see Section 2.3.2). However, as process technology scales, pass-transistor output voltages become
increasingly degraded due to the voltage scaling trends illustrated in Figure 4.1. That is, as process
technology scales, VDD is continually scaled down to reduce dynamic power dissipation and to keep
electric fields on shrinking feature sizes within acceptable bounds. To maintain good performance, VTh
is also scaled down, though at a slower rate than VDD to keep leakage current from growing too large
[6]. Recall from Section 2.3.2 that the logic-high output of a pass-transistor is degraded by a voltage
equal to VTh. Therefore, as VTh becomes an increasing fraction of VDD (Figure 4.1 shows that VTh/VDD
rose from ∼0.2 to ∼0.4 between 1997 and 2009), pass-transistor output voltages become increasingly
degraded. For the 22nm process we use in this work for example, the output of a non-gate boosted
pass-transistor switches only between 0V and 0.55V, whereas VDD is 0.8V. In addition, the waveform
slew rate rising above 0.45V is very slow. Consequently, the inverter sensing this signal (whose input
Chapter 4. Efficient FPGA Circuitry 37
1996 1998 2000 2002 2004 2006 2008 2010Year
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Volta
ge(V
)
VDDVTh
Figure 4.1: VDD and VTh scaling trends [12].
can remain near VDD/2 for some time) can experience a high short-circuit current and a slow switching
speed. Furthermore, recent work has shown that pass-transistor based FPGAs are very sensitive to
aging induced by positive bias temperature instability which has become larger with the new high-k gate
dielectrics [23, 5].
To increase the pass-transistor output voltage, one can apply larger amounts of gate boosting, but
this poses a reliability risk as larger VGS values accelerate device aging. Furthermore, the latest high-k
gate processes do not offer a “mid-oxide” thickness transistor; such transistors were available in 90nm
through 40nm conventional oxide processes to give a reduced gate leakage transistor option to designers
[48]. These mid-oxide thickness transistors were excellent pass-gates as their thicker oxide allowed a high
level of gate boosting without compromising reliability. With PMOS level-restorers the issue is one of
robustness. A VTh that is a larger fraction of VDD means it takes longer for level-restorers to turn on
(which increases short-circuit currents) or, in the extreme case, they might not turn on at all. Reliability
concerns, a higher susceptibility to device aging, performance degradation and increasing short-circuit
power dissipation make the pass-transistor an increasingly less desirable switch.
4.2.2 Replacing Pass-Transistors with Transmission Gates
One solution to the pass-transistor scaling problem detailed in the previous section is to stop using
pass-transistors entirely and instead build FPGAs out of CMOS transmission gates. Transmission gates,
which consist of an NMOS transistor and a PMOS transistor in parallel, are capable of passing a full
rail-to-rail voltage swing, making them more robust than pass-transistors at low VDD. Although the
idea of using transmission gates in certain parts of the FPGA circuitry is not new [41, 27], building
FPGAs entirely out of transmission gates has typically been avoided because the area of a transmission
gate-based FPGA would be significantly larger than that of a pass-transistor-based FPGA. However,
there is no prior work that quantifies how much larger a transmission gate FPGA would be, and since
Chapter 4. Efficient FPGA Circuitry 38
transmission gates are faster than pass-transistors due to their full voltage swing, it is unclear where in
the area-delay optimization space a fully transmission gate-based FPGA would fall in relation to a fully
pass-transistor-based FPGA. In this work, we locate them both in advanced process technology (with
PTM 22nm high-performance models [42]) by designing each type of FPGA from scratch, complete with
architectural design, circuit design and detailed transistor sizing.
To ensure our comparison is accurate, we choose an identical architecture for both our pass-transistor
and transmission gate FPGAs; specifically, the one described in Section 3.8.1. Since we use COFFE to
perform the transistor-level design of each FPGA, the circuit topologies for our pass-transistor FPGAs
are those described in Section 3.3. COFFE uses similar circuit topologies for our transmission gate
FPGAs but replaces pass-transistors with transmission gates. Figure 4.2b shows a transmission gate
implementation of a generic two-level routing multiplexer. Note that transmission gate routing multi-
plexers do not include PMOS level-restorers because transmission gates can pass a full voltage swing.
Our transmission gate 6-LUT topology is similar to the pass-transistor topology shown in Figure 3.4
but pass-transistors are replaced with transmission gates and the level-restoring PMOS transistors are
removed.
4.2.3 Gate-Boosting Strategy
Commercial FPGAs have often used a voltage greater than VDD on the gates of pass-transistors, which
is often called gate boosting. In addition to bringing the degraded logic-high outputs of pass-transistors
closer to VDD to mitigate static power dissipation concerns (see Section 2.3.2), gate boosting improves
performance. That is, the more VG is boosted above VDD, the faster a pass-transistor circuit will become
due to faster and larger swinging pass-transistor outputs. Therefore, a thorough comparison of pass-
transistor and transmission gate FPGAs should include an analysis of the effect of gate boosting both
switch types because even though transmission gate FPGAs do not suffer from degraded outputs, gate
boosting will improve their performance.
Gate boosting a routing multiplexer is achieved by connecting the SRAM cells to separate power
and/or ground rails (VSRAM+ and VSRAM− in Figure 4.2). Setting VSRAM+ above VDD will effectively
apply a higher voltage to the gates of NMOS transistors inside the multiplexer (provided the cell contains
a logic-high value). In addition to increasing VSRAM+, transmission gate FPGAs can set VSRAM− below
0V to improve PMOS transistor performance. Since SRAM cells only switch at configuration time, gate
boosting a routing multiplexer does not increase dynamic power consumption and high-VTh, low-leakage
transistors can be used in the SRAM cells to minimize static power consumption (their speed is not
important). Through HSPICE simulation, we found that boosting the voltage by 200mV on an SRAM
cell built from PTM 22nm low-power transistors increased its static leakage by 3.6×. However, the
SRAM contribution to the chip-wide static power consumption remained below 2mW1.
Gate boosting a lookup table can be accomplished by running the LUT input drivers (see Figure
3.4) at a higher voltage. This would increase dynamic power consumption as the LUT inputs toggle
frequently during device operation and would also require level-converters on the time critical LUT input
path. Gate boosting lookup tables in thus both more complex and less beneficial than gate boosting
routing multiplexers. Therefore, we do not gate boost lookup tables.
Too much gate boosting will cause faster aging by accelerating time-dependent dielectric breakdown
1Increasing VSRAM+ from 0.8V to 1.0V increased a tile’s total SRAM leakage from 14.06nA to 51.08nA. Large modernFPGAs can have as many as 36, 000 tiles [4]. Therefore, total chip leakage is 51.08nA×1.0V × 36, 000 = 0.0018W .
Chapter 4. Efficient FPGA Circuitry 39
Level-restorer
out
SRAM cell
VG
VSRAM+
VSRAM-
WL
BLBL
buf1
buf2
lvl1 lvl1
lvl2
Pass-transistor 2-level MUX 2-stage bufferSRAM details
(a) Pass-transistor implementation of a two-level routing multiplexer.
VG
VSRAM+
VSRAM-
WL
BLBL
out
buf1
buf2lvl1 lvl1 lvl2
SRAM details Transmission gate 2-level MUX 2-stage buffer
(b) Transmission gate implementation of a two-level routing multiplexer.
Figure 4.2: Generic two-level routing multiplexer with two-stage buffer implemented with pass-transistors(a) and transmission gates (b).
Chapter 4. Efficient FPGA Circuitry 40
Gate Voltage (NMOS/PMOS)
0
5
10
15
20
Del
ayR
educ
tion
(%)
0.9/
0.0
1.0/
0.0
NMOS Only
0.8/
-0.1
0.8/
-0.2
0
5
10
15
20PMOS Only
SRAM Overstress
0.1V0.2V0.3V0.4V
0.9/
-0.1
0.9/
-0.2
1.0/
-0.1
1.0/
-0.2
0
5
10
15
20NMOS & PMOS
Figure 4.3: Effect of different gate boosting strategies on transmission gate switch block multiplexerdelay (VDD = 0.8V ).
and bias-temperature instability or could even destroy the transistor. It is difficult to determine exactly
how much gate boosting is tolerable without compromising reliability, particularly for newer processes,
making it an active topic of investigation [32, 11]. Thus, since it is unclear exactly how much gate
boosting is safe for a 22nm process, we sweep the gate voltage over three values (VDD, VDD + 0.1V and
VDD + 0.2V ) thus providing a general indication of the effect of gate boosting from which a safe gate
boosting level can be chosen.
One final question must be answered before we have a complete gate boosting strategy: how do we
gate boost the transmission gates? A transmission gate can be gate boosted by applying a voltage larger
than VDD on the gate of the NMOS transistor, by applying a voltage smaller than 0V on the gate of
the PMOS transistor or by a mixture of both. It isn’t immediately clear which one of these options
is best. Therefore, we experiment with different levels of gate boosting on our completely optimized,
non-gate boosted, transmission gate FPGA design. Figure 4.3 shows the delay reductions observed in
the switch block multiplexers; results for other multiplexers follow the same trend. Gate boosting only
the NMOS transistor (leftmost bar graph) results in almost twice the delay reduction that is obtained
when only the PMOS transistor is gate boosted and results in nearly the same amount of delay reduction
obtained when both transistors are gate boosted. Therefore, we choose to only gate boost the NMOS
transistors of transmission gates since the additional delay reduction achieved by also gate boosting the
PMOS transistors probably does not merit the creation of a new supply plane. As well, some transistors
in the configuration SRAMs will be subjected to a voltage difference of VSRAM+ − VSRAM−. Hence,
simultaneously gate boosting both NMOS and PMOS transistors by some voltage increases the reliability
risk versus gate boosting only the NMOS transistors by that voltage. Bars of the same color in Figure
4.3 have the same stress on the SRAM cells.
Thus, our gate boosting strategy is the following. We gate boost the NMOS transistor of routing
multiplexers (but not LUTs) for both pass-transistor and transmission gate FPGAs. This is accomplished
by increasing the VSRAM+ voltage, which we sweep over three values: VDD, VDD+0.1V and VDD+0.2V .
Chapter 4. Efficient FPGA Circuitry 41
Place and route benchmarks with VPR
COFFE v0.1
HSPICE + PTM 22nm
Area Model
Wire Model
Architecture and circuit design
Critical Path Delay
Subcircuit usage count
Delay per subcircuit
Transistor sizes
Power per subcircuit
Power
Power calculations
VPR arch. files
Tile area calculations
Tile Area
Transistor-Level Design
Measurement
Figure 4.4: CAD flow for each FPGA.
4.2.4 Methodology
Our goal is to examine how the area, delay and power of transmission gate FPGAs compare to that of
pass-transistor FPGAs. In the previous section, we chose a gate boosting strategy that involves sweeping
the gate voltage over three values. Consequently, we will look at six different FPGAs representing
three levels of gate boosting for both pass-transistor and transmission gate switches. All six FPGAs
have identical architectural parameters (described in Section 3.8.1) but they differ in circuit design.
Throughout the remainder of this section, we refer to these FPGAs as implementations. Figure 4.4
shows the CAD flow used to obtain tile area, critical path delay and dynamic power for each FPGA
implementation.
To obtain a fair comparison, we must first size the transistors of each FPGA implementation such
that all are optimized for a common objective (top portion of Figure 4.4). Note that it is important to
perform transistor sizing for each level of gate boosting because gate boosting affects the voltage-transfer
characteristics of the circuits. We use COFFE2 to size the transistors of our six FPGA implementa-
tions. The optimization objective is set to minimize area-delay product and we optimize each subcircuit
individually (local optimization). When transistor sizing is complete, COFFE yields the final transistor
sizes along with the delay and dynamic power of each subcircuit for each FPGA implementation. Dy-
namic power is obtained for each subcircuit by using HSPICE to measure the average current required
to propagate a rising and a falling transition through the subcircuit and then multiplying it by VDD.
2This work was performed with an earlier version of COFFE than the one described in Chapter 3. In this version,COFFE did not account for the extra area needed for N-well spacing. That is, the area model consisted of only Equation3.1. For our transmission gate FPGAs, this difference in area modeling implicitly assumed that the extra PMOS transistorscan be placed in existing N-wells. If this is not possible and additional N-wells are required, we estimate that transmissiongate FPGA area would increase by at most 7%, which would not significantly affect our overall conclusions.
Chapter 4. Efficient FPGA Circuitry 42
Table 4.3: Pass-transistor and transmission gate FPGA tile area for different levels of gate boosting.
VGTile area (µm2)
TG/PTPass-transistor Transmission gate
VDD 875 1006 15.0%VDD + 0.1V 873 1010 15.7%VDD + 0.2V 887 1015 14.5%
Table 4.4: Switch block multiplexer transistor sizes for PT and TG implementations for different levels ofgate boosting (see Figure 4.2 for transistor labels). Note that with the exception of P/N ratios, COFFEuses integer granularity.
Type VGlvl1 lvl2 buf1 buf2
PMOS NMOS PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 - 3 - 3 3 3.2 20.7 11PT VDD + 0.1 - 2 - 2 7.0 3 31.6 12PT VDD + 0.2 - 2 - 2 12.6 3 37.5 14TG VDD + 0.0 1 1 1 1 3 4.3 35.7 21TG VDD + 0.1 1 1 1 1 4 4.3 39.9 19TG VDD + 0.2 1 1 1 1 5.6 4 44.9 19
Once all FPGA implementations have been optimized, tile area, critical path delay and power can
be measured (bottom portion of Figure 4.4). Tile area is obtained by first calculating the area of each
FPGA subcircuit based on the final transistor sizes obtained from COFFE. Then, the subcircuit areas
are multiplied by the number of subcircuits in a tile (Table 3.5) and summed to obtain total area. A
VPR architecture file is created for each of the six FPGA implementations with the delay-per-subcircuit
values obtained from COFFE. Critical path delay is measured experimentally with VPR by placing and
routing MCNC [53] and VTR [43] benchmarks on each FPGA for five different placement seeds. To
compute relative total power, we multiply the power-per-subcircuit numbers by the average number
of times each subcircuit is used in VPR placed and routed benchmarks. Since we are only interested
in a relative power comparison between our six FPGA implementations, we do not need to perform a
functional simulation to obtain toggle activities as we expect them to be the same across implementations
except for very slight glitch changes due to small variations in timing.
4.2.5 Results
Table 4.3 shows the tile area for pass-transistor (PT) and transmission gate (TG) FPGAs with differ-
ent levels of gate boosting. VDD is 0.8V , the nominal VDD for this process, throughout this section.
The results indicate that transmission gate FPGAs are approximately 15% larger than pass-transistor
FPGAs. At first glance, a 15% tile area increase may seem surprisingly small as building FPGAs out
of transmission gates instead of pass-transistors doubles the number of transistors per switch in the
routing multiplexers and lookup tables, which make up a large fraction of an FPGA tile. There are
three contributing factors to this modest area increase. First, although transmission gate FPGAs re-
quire two transistors for each switch instead of just one, the area of each switch is not doubled due to
differences in transistor sizing. That is, the area of a transmission gate is usually equal to 2 minimum-
width transistor areas, as COFFE usually finds that minimum-size transmission gates (minimum size
Chapter 4. Efficient FPGA Circuitry 43
Table 4.5: Pass-transistor and transmission gate FPGA critical path delay for different levels of gateboosting (VTR benchmarks).
VGCritical path delay (ns)
TG/PTPass-transistor Transmission gate
VDD 23.3 17.4 -25.4%VDD + 0.1V 18.9 15.8 -16.3%VDD + 0.2V 16.2 14.4 -10.7%
Table 4.6: Pass-transistor and transmission gate FPGA area-delay product for different levels of gateboosting (VTR benchmarks).
VGArea-delay product
TG/PTPass-transistor Transmission gate
VDD 1.00 0.86 -14%VDD + 0.1V 0.81 0.78 -3%VDD + 0.2V 0.70 0.72 2%
Table 4.7: Pass-transistor and transmission gate FPGA relative dynamic power for different levels ofgate boosting (VTR benchmarks).
VGRelative dynamic power
TG/PTPass-transistor Transmission gate
VDD 1.00 1.04 3.8%VDD + 0.1V 0.99 1.05 6.4%VDD + 0.2V 1.02 1.06 4.4%
NMOS and minimum size PMOS) are most efficient while the area of a pass-transistor is usually 1.26
to 1.51 minimum-width transistor areas, as COFFE usually sizes pass-transistors 2 to 3 times larger
than the minimum size (see Appendix C). Second, SRAM cell area is constant for both types of FPGAs
and accounts for a large fraction of tile area (see Section 4.2.6). Finally, transmission gate FPGA area
benefits from the removal of PMOS level-restorers, which are required in pass-transistor FPGAs but not
in transmission gate FPGAs.
Gate boosting does not significantly affect tile area. In general, we noticed that as the level of gate
boosting is increased on pass-transistor FPGAs, our transistor sizing tool tends to reduce pass-transistor
sizes but increases buffer sizes resulting in an FPGA that has similar tile area but reduced delay. Due
to their larger area, our transistor sizing tool almost always chooses minimum sized transmission gates.
The buffers in transmission gate FPGAs are larger than those of pass-transistor FPGAs due to more
transistor and wire loading. The P/N ratios of buffers are also different for different levels of gate
boosting, as the signal swings at the buffer inputs are changing. Table 4.4 shows the transistor sizes
for a switch block multiplexer in units of minimum contactable transistor width (45nm in this 22nm
process). Transistor sizes for all subcircuits are given in Appendix C.
Table 4.5 shows average critical path delay for all 6 FPGA designs for the VTR benchmark set
(MCNC benchmarks yielded similar results and hence results are not shown). The results show that,
with no gate boosting, transmission gate FPGAs are 25% faster than pass-transistor FPGAs. As the
Chapter 4. Efficient FPGA Circuitry 44
SB MUX 31.3%
CB MUX 22.0%
Local MUX 16.4% FF
1.2%
Cluster Output 2.2%
LUT 26.9%
(a) Tile area breakdown.
SB MUX 36.7%
CB MUX 14.5%
Local MUX 16.2%
FF 0.4%
Cluster Output 4.7%
LUT 25.4%
Other 2.0%
(b) Critical path delay breakdown.
Figure 4.5: Tile area and critical path delay breakdown.
level of gate boosting is increased, the delay gap is reduced but transmission gate FPGAs remain faster.
The higher speed with transmission gates is due to the increased multiplexer output voltage swing and
the fact that we now have two switch transistors in parallel, providing lower resistance. The resistance of
transmission gates is further reduced in advanced processes because highly strained silicon has narrowed
the gap between PMOS and NMOS carrier mobility. For example, in the 22nm process we use, PMOS
transistor drive strength is 66% of the NMOS transistor drive strength for the same width. In older
process nodes (e.g. 0.35µm), PMOS transistor drive strength was only 37% that of NMOS transistors
[7].
The area-delay product for each FPGA design is given in Table 4.6. With no gate boosting, transmis-
sion gate FPGAs have an area-delay product that is 14% lower than pass-transistor FPGAs. However,
given the right amount of gate boosting (in this case somewhere between +0.1V and +0.2V), pass-
transistor FPGAs eventually become more efficient than transmission gate FPGAs.
Table 4.7 shows dynamic power, normalized to the non-gate boosted pass-transistor FPGA imple-
mentation. Transmission gate FPGAs consume slightly more power than pass-transistor FPGAs. This
is likely due to their larger tile area. The small decrease in power consumption experienced by pass-
transistor FPGAs with 0.1V of gate boosting is due to reduced short-circuit current. With 0.2V of gate
boosting however, the gains from reduced short-circuit current are lost due to the power increase from
higher voltage swings in the internals of the pass-transistor multiplexers.
4.2.6 Area and Delay Breakdown
Figure 4.5a shows the area contributions of different FPGA subcircuits for our pass-transistor FPGA
with 0.1V of gate boosting. Approximately 28% of the area is devoted to BLEs (LUT + FF) leaving 72%
of the area to routing. This number is lower than the 90% routing area commonly quoted in academic
work (e.g. [31]), but is higher than the commercial Stratix V architecture where routing area is said
to account for only 50% of tile area [35]. This discrepancy could be due to our architecture having
fewer features than commercial architectures (e.g adders, more complex FFs, LUTRAM, etc.). SRAM
cells account for 40% of tile area for pass-transistor FPGAs and 35% of tile area for transmission gate
FPGAs.
Chapter 4. Efficient FPGA Circuitry 45
0.6 0.7 0.8VDD (V)
10
20
30
40
50
60
Crit
ical
Pat
hD
elay
(ns)
PT, VG=VDDPT, VG=0.8VTG, VG=VDDTG, VG=0.8V
Figure 4.6: Critical path delay for pass-transistor (PT) and transmission gate (TG) FPGAs for differentVDD and VG.
The critical path contributions for our pass-transistor FPGA with 0.1V of gate boosting are shown
in Figure 4.5b. Approximately 26% of the critical path delay comes from the BLEs, 72% comes from
the routing and 2% comes from hard multipliers and block memory (where we use Stratix IV-like delay
values). The area and delay breakdowns for our other pass-transistor and transmission gate FPGAs
follow the same trends and are given in Appendix D.
4.3 Separating VDD and VG for Low-Power FPGAs
An FPGA that employs adaptive voltage scaling can trade delay for power by using an operating VDD
that is lower than its nominal supply voltage (VDDn). To reduce the delay penalty without adversely
affecting power, the resulting low-power FPGA can mimic the concept of gate boosting by lowering VDD
but not VG. What is particularly interesting about decoupling VDD and VG in this way is the fact that,
as long as VG <= VDDn, “gate boosting” low-power FPGAs does not pose a reliability risk as it does
for FPGAs running at VDDn where any amount of gate boosting results in VG > VDDn.
We explore the idea of adaptive voltage scaling with different VDD and VG on our non-gate boosted
pass-transistor and transmission gate FPGA implementations from Section 4.2 (that have been fully
optimized for VDD = 0.8V ) by experimenting with two low-power FPGA schemes. In the first, VDD and
VG are kept equal and are both lowered below 0.8V to produce a low-power mode. In the second, VG
is maintained at 0.8V and only VDD is lowered, resulting in a “gate boosted” low-power mode. Figures
4.6 and 4.7 show critical path delay and dynamic power (normalized to the pass-transistor FPGA with
VDD = VG = 0.8V ) respectively for both schemes. The results show that lowering VDD and VG to
0.6V results in a 2× power reduction for both pass-transistor and transmission gate FPGAs but a 6×and 2.5× increase in delay respectively. However, if we maintain VG at 0.8V when VDD is lowered
to 0.6V, pass-transistor and transmission gate FPGA delays improve by 65% and 18% respectively at
no additional power cost. Clearly pass-transistor FPGAs are a very poor choice for low-power if gate
voltages are not maintained at VDDn.
Chapter 4. Efficient FPGA Circuitry 46
0.6 0.7 0.8VDD (V)
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Nor
mal
ized
Pow
er
PT, VG=VDDPT, VG=0.8VTG, VG=VDDTG, VG=0.8V
Figure 4.7: Dynamic power for pass-transistor (PT) and transmission gate (TG) FPGAs for differentVDD and VG.
0.6 0.7 0.8VDD (V)
5
6
7
8
9
10
11
12
13
14
Pow
er-D
elay
Pro
duct
PT, VG=VDDPT, VG=0.8VTG, VG=VDDTG, VG=0.8V
Figure 4.8: Power-delay product for pass-transisor (PT) and transmission gate (TG) FPGAs for differentVDD and VG.
Figure 4.8 shows that decoupling VDD and VG for low-power FPGAs is very beneficial. If we maintain
VG at 0.8V, the VDD yielding minimal power-delay product shifts from 0.8V to 0.7V where we experience
a 25% power reduction. In addition, the results indicate that transmission gate FPGAs always achieve
lower power-delay product than pass-transistor FPGAs in the low-power regime with a 26% advantage
at 0.6V.
Chapter 4. Efficient FPGA Circuitry 47
Logic Cluster
Wire load spans ½ tile (locality)
Cluster output A
Switch block multiplexer
Cluster output B
Wire load spans 1 tile (no locality)
Routing Channel
Figure 4.9: Cluster output wire load for different locality.
Logic Cluster
MUX input wire load can span up
to ½ tile
Cluster input A
Cluster input B
MUX input wire load can span
up to 1 tile Ro
uti
ng
Ch
ann
elFigure 4.10: Cluster input wire load for different locality.
4.4 Track-Access Locality
In Section 3.8, we showed that wire loading at the logic-to-routing interface has a considerable impact
on delay. Prior academic work has implicitly assumed that logic cluster pins can access all the routing
tracks in an adjacent channel but has not considered the large logic-to-routing wire loading that this
creates. It is possible to reduce this wire load by imposing limits on the lengths of logic-to-routing wires.
We refer to this concept as track-access locality and we define track-access span as the portion of a routing
channel that can be accessed by a logic cluster input or output. A large span implies little locality and
vice-versa. Figure 4.9 illustrates this concept for logic cluster outputs. In the figure, output A can only
reach half of the routing tracks in a channel (the 50% physically close to it) while output B can reach
all of them. Output A has a track-access span of 1/2; output B has a track-access span of 1. Clearly,
output B has twice as much wire load as output A. Thus, output A is faster than output B. Figure 4.10
illustrates the same concept as it applies to logic cluster inputs. The wire loading associated with cluster
inputs comes from the wires required to connect routing tracks to the connection block multiplexers.
This wire loading is seen by the routing wire drivers and will tend to slow down the general routing
wires.
Track-access locality should not be confused with Fcin and Fcout. While Fcin and Fcout specify the
Chapter 4. Efficient FPGA Circuitry 48
Table 4.8: Effect of cluster output track-access locality on area and delay. Input track-access span is setto 0.5.
Cluster Output Tile Area Delay Area-DelayTrack-Access-Span (µm2) (ps) Product
1.00 959 113 1.080.75 934 115 1.070.50 930 114 1.060.25 938 112 1.05
Table 4.9: Effect of cluster input track-access locality on area and delay. Output track-access span isset to 0.25.
Cluster Input Tile Area Delay Area-DelayTrack-Access-Span (µm2) (ps) Product
1.00 952 127 1.210.75 953 117 1.110.50 938 112 1.050.25 955 105 1.00
number of tracks to which a logic cluster input or output will connect, track-access span specifies the
fraction of the routing channel in which the logic cluster input or output is authorized to make these
connections. Consequently, track-access span defines an upper bound on Fcin and Fcout. For example,
if the cluster input track-access span is set to 0.25, Fcin cannot be made larger than 0.25, because when
Fcin = 0.25, a cluster input pin already connects to every routing track to which is has access.
We use COFFE to size the transistors of the FPGA architecture described in Section 3.8.1 for
different degrees of track-access locality. Table 4.8 shows the effect of cluster output locality on tile
area and representative path delay while Table 4.9 shows results for cluster input locality. The results
suggest that reducing the input track-access span can lead to a large reduction in loading (∼17% delay
reduction for a span of 0.25). The effect is lesser for cluster outputs but we still observe a small reduction
in overall area-delay product. Although track-access locality seems beneficial from a delay perspective,
it could have a negative impact on routability since increasing locality could reduce the interconnect
flexibility. It follows that the ideal track-access span will likely also depend on the values of Fcin and
Fcout. For example, for our Fcin = 0.2 and Fcout = 0.025 architecture, cluster outputs may be better
suited for high locality given the fact that they connect to relatively few routing multiplexers due to
a low Fcout value. A detailed analysis of these tradeoffs was not performed in this work but merits
future research. VPR currently does not support track-access locality – it spreads switches across the
entire routing channel. Therefore, VPR code changes would also be necessary to investigate track-access
locality thoroughly. When used with an architecture exploration tool such as VPR, COFFE enables
a thorough evaluation of such architectural issues which combine changes in connectivity, loading and
transistor sizing.
Chapter 5
Conclusions and Future Work
5.1 Summary
The transistor-level design of an FPGA has a large impact on its area, delay and power characteristics.
High-quality transistor-level design is thus important to obtain efficient FPGAs. In addition, transistor-
level design is necessary for conducting thorough architecture exploration where the transistor-level
design of multiple architectures must be performed before evaluating them through placement and
routing experiments. Automated transistor-level design tools are therefore invaluable in creating highly
efficient FPGA architecture implementations in reasonable amounts of time.
In this thesis, we developed COFFE, a new fully-automated transistor sizing tool for FPGAs1. We
showed that for fined-grained transistor-level design in advanced process nodes, modeling transistors as
linear resistances and capacitances as in previous transistor sizing tools is highly inaccurate. For that
reason, COFFE maintains all circuit non-linearities by relying exclusively on HSPICE simulations to
measure delay. COFFE estimates area with a version of the minimum-width transistor area model to
which we’ve made a number of improvements to enhance its accuracy in advanced process nodes. We
showed that only accounting for the loading effects of long wires as has been done in prior work can lead
to delay under-predictions of 24%. To ensure realistic transistor sizing, COFFE automatically models all
wire loads, without requiring manual layout. These enhanced models have an important architectural
impact: they favor larger transistors in FPGA lookup tables and multiplexers.
In the second part of this thesis, we used COFFE to investigate a number of FPGA circuit design
related questions. First, we re-investigated logic block output pin flexibility (Fcout) as this is an architec-
tural question that hasn’t been fully investigated for single-driver routing architectures and multi-output
BLEs. We found that for an N = 10, K = 6 architecture, an Fcout = 0.025W yields an FPGA with
better area delay product than the Fcout = 0.1W recommended by prior work.
Second, we compared the area, delay and power of transmission gate-based FPGAs to those of
pass-transistor FPGAs in 22nm process technology as pass-transistor performance and reliability have
been degrading with technology scaling. We showed that transmission gate FPGAs consume 15% more
area than pass-transistor FPGAs but are 25%, 16% and 10% faster for 0V, 0.1V, and 0.2V of gate
boosting respectively. In terms of area-delay product, transmission gate FPGAs are 14% better than
pass-transistor FPGAs without gate boosting but 2% worse with 0.2V of gate boosting. Clearly, if gate
1COFFE is available online at: http://www.eecg.utoronto.ca/~vaughn/software.html
49
Chapter 5. Conclusions and Future Work 50
boosting is not permitted, building FPGAs out of transmission gates is the better choice. However, given
enough gate boosting, pass-transistor FPGAs are still more efficient. Even if 0.2V of gate boosting is
safe, however, a case can be made for transmission gate FPGAs due to the reliability concerns associated
with pass-transistors in advanced process technology as they incur only a 2% area-delay product and
5% power increase.
Third, we explored the idea of using separate VDD and VSRAM+ voltages for low-power FPGAs. We
found that maintaining VSRAM+ at the nominal VDD of 0.8V when lowering VDD to 0.6V to reduce power
helps reduce the delay penalty associated with low-power operation. We also found that transmission
gate FPGAs always have a better power-delay product than pass-transistor FPGAs. Therefore, if low-
VDD operation is desired, transmission gate FPGAs that maintain VSRAM+ at the nominal VDD yield
the FPGAs with the best power-delay product.
Finally, we investigated a new architectural question concerning the wire loading at the interface
between routing channels and logic blocks. We found that, at a possible cost in routability, restricting
the portion of a routing channel that can be accessed by a logic block input can reduce delay by up to
17%.
5.2 Future Work
COFFE enables three categories of future work. The first involves using COFFE alongside VPR to
perform architecture exploration in advanced process nodes. Architecture investigations into the best
architecture parameter values such as lookup table size and logic cluster size have been performed
before. However, as process technology scales, the characteristics of the underlying circuitry may change,
causing the answer to these architectural questions to change as well. COFFE makes re-visiting these
questions much easier as it provides automated transistor-level design for each architecture. Architectural
extensions to COFFE also fall into this category of future work. For example, COFFE could be extended
to support fracturable LUTs and carry chains.
The second category of future work enabled by COFFE consists of circuit design investigations. In
this thesis, we investigated the area, delay and power impact of replacing pass-transistors in FPGAs by
transmission gates in conventional 22nm process technology. It would be interesting to also investigate
the impact of using FinFETs to build both pass-transistor and transmission gate FPGAs. Future work
could also include circuit topology investigations such as optimal internal buffer placement in LUTs and
efficient multiplexer topologies.
The final category of future work consists of exploring the interactions between architecture and
circuit design. For example, the circuit design aspect of track-access locality, an investigation conducted
in this work, found that reducing the portion of a routing channel that can be accessed by a logic
block input is beneficial for delay. However, this could also have an architectural implication: reduced
routability. Therefore, this is an architecture and circuit design interaction that merits a more thorough
investigation through modifications to VPR to examine the impact of track-access locality on routability.
Appendix A
N-well Sharing Sample Layout
Figure A.2 shows how pass-transistor multiplexers such as that of Figure A.1 can be laid out to efficiently
share N-wells. In this sample layout, the PMOS transistors of two multiplexers share an N-well by being
placed in a vertical strip that is two transistors wide. This two transistor wide strip of PMOS transistors
can be made taller as needed to add more multiplexers to the layout. Thus, the layout of Figure A.2 is
such that transistors requiring N-well spacing only require it on one side, which effectively reduces the
amount of N-well spacing required by 75% compared to a layout where each transistor is in a separate
well. Note that the layout of Figure A.2 is a simplified representation, as it assumes minimum-width
transistors and does not show metal layout details.
out
in1
in2
in3
in4
q1
q2
q3
q4
q5
q6
q7
q8
q9
n1 n2
Figure A.1: A single-level 4:1 pass-transistor multiplexer with two-stage buffer and level restorer.
51
Appendix A. N-well Sharing Sample Layout 52
GND VDD
in1
in2
in3
in4
Out
GND
GND
Out
GND
Pass-transistor 4:1 single-level MUX with two-stage buffer and level-restorer
N-well
NMOS
PMOS
q2
q3
q4
q1
q7
q9
q5
q6
q8
n1
n2
Figure A.2: Sample multiplexer layout with N-well sharing.
Appendix B
FPGA Circuitry Schematics
This appendix gives pass-transistor circuit schematics for all FPGA subcircuits designed in this thesis.
IN_A IN_B IN_C IN_D IN_E IN_F
SRAM
buf1 lvl1 lvl2 lvl3
buf2 buf3
buf4 buf5lvl4 lvl5 lvl6
LUT input drivers
Figure B.1: 6-LUT.
buf1
buf2 buf3
Figure B.2: LUT input driver.
53
Appendix B. FPGA Circuitry Schematics 54
Level-restorer
SRAM cell
buf2
buf3
lvl1
Connected to FF output
Connected to local routing multiplexer
outputbuf1
buf4 buf5
LUT input driver
Register feedback multiplexer
Figure B.3: LUT input driver with register feedback multiplexer.
Level-restorer
out
SRAM cell
2-stage buffer
buf1
buf2
lvl1 lvl1
lvl2
2-level MUX
Figure B.4: Two-level multiplexer used for switch block, connection block and local routing multiplexers.
Level-restorer
out
SRAM cell
2-stage buffer
buf1
buf2lvl1
2:1 MUX
Figure B.5: 2:1 multiplexer used for BLE outputs.
Appendix B. FPGA Circuitry Schematics 55
buf1
clk1 clk2
buf2
buf3
buf4
buf5
buf6
lvl1 set
reset set
reset
Input select MUX Master-slave register
Figure B.6: Flip-flop with register input selection multiplexer.
Appendix C
Detailed Transistor Sizing Results
This appendix gives transistor sizes for all subcircuit of our pass-transistor (PT) and transmission gate
(TG) FPGAs for different levels of gate boosting. See Appendix B for corresponding schematics with
transistor labels.
Table C.1: Lookup table transistor sizes.
Type VGbuf1 lvl1 lvl2 lvl3
PMOS NMOS PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 5.5 2 − 3 − 4 − 3PT VDD + 0.1 5.5 2 − 3 − 4 − 3PT VDD + 0.2 5.5 2 − 3 − 4 − 3TG VDD + 0.0 2 2.7 1 1 1 1 1 1TG VDD + 0.1 2 2.7 1 1 1 1 1 1TG VDD + 0.2 2 2.7 1 1 1 1 1 1
Type VGbuf2 buf3 lvl4 lvl5
PMOS NMOS PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 1 4.7 6.1 2 − 9 − 10PT VDD + 0.1 1 4.7 6.1 2 − 9 − 10PT VDD + 0.2 1 4.7 6.1 2 − 9 − 10TG VDD + 0.0 3.4 2 8.0 7 6 6 4 4TG VDD + 0.1 3.4 2 8.0 7 6 6 4 4TG VDD + 0.2 3.4 2 8.0 7 6 6 4 4
Type VGlvl6 buf4 buf5
PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 − 7 3 14.0 9.9 4PT VDD + 0.1 − 7 3 13.9 10.0 4PT VDD + 0.2 − 7 3 13.3 10.0 4TG VDD + 0.0 3 3 7.8 7 9.0 6TG VDD + 0.1 3 3 7.1 7 8.5 6TG VDD + 0.2 3 3 7.7 7 9.6 7
56
Appendix C. Detailed Transistor Sizing Results 57
Table C.2: Switch block multiplexer transistor sizes.
Type VGlvl1 lvl2 buf1 buf2
PMOS NMOS PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 − 3 − 3 3 3.2 20.7 11PT VDD + 0.1 − 2 − 2 7.0 3 31.6 12PT VDD + 0.2 − 2 − 2 12.6 3 37.5 14TG VDD + 0.0 1 1 1 1 3 4.3 35.7 21TG VDD + 0.1 1 1 1 1 4 4.3 39.9 19TG VDD + 0.2 1 1 1 1 5.6 4 44.9 19
Table C.3: Connection block multiplexer transistor sizes.
Type VGlvl1 lvl2 buf1 buf2
PMOS NMOS PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 − 2 − 3 3 3.5 17.8 4PT VDD + 0.1 − 2 − 3 4.9 2 16.0 3PT VDD + 0.2 − 2 − 3 10.6 2 16.4 3TG VDD + 0.0 1 1 1 1 3 4.6 14.0 6TG VDD + 0.1 1 1 1 1 3 3.1 14.1 5TG VDD + 0.2 1 1 1 1 3.2 2 16.2 5
Table C.4: Local routing multiplexer transistor sizes. Note: We don’t give a size for buf2 of the localrouting multiplexer as it is replaced by the LUT input driver of Figure B.2.
Type VGlvl1 lvl2 buf1
PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 − 2 − 3 2 3.9PT VDD + 0.1 − 2 − 3 4.0 2PT VDD + 0.2 − 2 − 2 8.2 2TG VDD + 0.0 1 1 1 1 1 1.1TG VDD + 0.1 1 1 1 1 2 2.2TG VDD + 0.2 1 1 1 1 2.5 2
Table C.5: BLE output to local interconnect.
Type VGlvl1 buf1 buf2
PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 − 3 2 4.6 16.6 4PT VDD + 0.1 − 2 2.6 2 10.0 3PT VDD + 0.2 − 2 4.2 2 12.6 4TG VDD + 0.0 1 1 1.6 1 6.8 4TG VDD + 0.1 1 1 2 2.4 9.0 5TG VDD + 0.2 1 1 1.5 1 7.2 4
Appendix C. Detailed Transistor Sizing Results 58
Table C.6: BLE output to general routing.
Type VGlvl1 buf1 buf2a
PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 − 3 3 6.3 32.6 20PT VDD + 0.1 − 3 5.0 5 39.1 23PT VDD + 0.2 − 2 5.1 4 39.9 24TG VDD + 0.0 1 1 4.4 3 41.4 27TG VDD + 0.1 1 1 4 4.8 41.5 28TG VDD + 0.2 1 1 4.2 4 41.9 28
aThese transistor sizes were generated with an older version of COFFE. COFFE currentlysizes the BLE output driver (buf2 ) smaller than shown in this table. For example, for apass-transistor FPGA with VG = VDD + 0.2V , buf2 has NMOS = 4 and PMOS = 9.9.
Table C.7: Flip-flop and register selection multiplexer transistor sizes.
Type VGlvl1 buf1 clk1 buf2
PMOS NMOS PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 − 5 3.5 3 1 1 4 3PT VDD + 0.1 − 5 8.2 3 1 1 4 3PT VDD + 0.2 − 5 12.6 3 1 1 4 3TG VDD + 0.0 2 2 3 3.2 1 1 4 3TG VDD + 0.1 2 2 3 3.7 1 1 4 3TG VDD + 0.2 2 2 3.0 3 1 1 4 3
Type VGbuf3 clk2 buf4 buf5
PMOS NMOS PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 1.3 1 1 1 1.3 1 1.3 1PT VDD + 0.1 1.3 1 1 1 1.3 1 1.3 1PT VDD + 0.2 1.3 1 1 1 1.3 1 1.3 1TG VDD + 0.0 1.3 1 1 1 1.3 1 1.3 1TG VDD + 0.1 1.3 1 1 1 1.3 1 1.3 1TG VDD + 0.2 1.3 1 1 1 1.3 1 1.3 1
Type VGbuf6 set reset
PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 9.4 4 1 1 1 1PT VDD + 0.1 9.7 4 1 1 1 1PT VDD + 0.2 9.0 4 1 1 1 1TG VDD + 0.0 8.0 4 1 1 1 1TG VDD + 0.1 7.8 5 1 1 1 1TG VDD + 0.2 7.9 5 1 1 1 1
Appendix C. Detailed Transistor Sizing Results 59
Table C.8: LUT input driver A.
Type VGbuf1 buf2 buf3
PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 6.6 3 2.0 1 3 3.3PT VDD + 0.1 6.6 3 2.0 1 3 3.3PT VDD + 0.2 7.8 4 2.0 1 4 4.0TG VDD + 0.0 5.8 4 1.5 1 5.9 4TG VDD + 0.1 5.8 4 1.5 1 5.9 4TG VDD + 0.2 5.8 4 1.5 1 5.9 4
Table C.9: LUT input driver B.
Type VGbuf1 buf2 buf3
PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 4.5 2 1.9 1 2 2.1PT VDD + 0.1 4.5 2 1.9 1 2 2.1PT VDD + 0.2 8.1 4 2.0 1 5.0 4TG VDD + 0.0 4.4 3 1.6 1 4.8 3TG VDD + 0.1 4.4 3 1.6 1 4.8 3TG VDD + 0.2 4.4 3 1.6 1 4.8 3
Table C.10: LUT input driver C with register feedback multiplexer (Figure B.3).
Type VGbuf1 lvl1 buf2
PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 8.6 4 − 3 2 6.3PT VDD + 0.1 7.0 3 − 2 2 2.4PT VDD + 0.2 7.9 3 − 2 2.2 2TG VDD + 0.0 6.1 4 1 1 1.2 1TG VDD + 0.1 7.2 4 1 1 2 2.11TG VDD + 0.2 8.2 4 1 1 2.2 2
Type VGbuf3 buf4 buf5
PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 4.0 2 1.8 1 2.1 2PT VDD + 0.1 3.6 2 2.2 1 2.2 2PT VDD + 0.2 3.6 2 2.2 1 2.2 2TG VDD + 0.0 2.8 2 1.5 1 3.3 2TG VDD + 0.1 3.0 2 1.4 1 2.3 2TG VDD + 0.2 3.0 2 1.5 1 3.3 2
Appendix C. Detailed Transistor Sizing Results 60
Table C.11: LUT input driver D.
Type VGbuf1 buf2 buf3
PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 4.7 2 1.9 1 2 2.2PT VDD + 0.1 4.7 2 1.9 1 2 2.2PT VDD + 0.2 8.1 4 2.0 1 5.0 4TG VDD + 0.0 5.8 4 1.5 1 6.2 4TG VDD + 0.1 5.8 4 1.5 1 6.2 4TG VDD + 0.2 5.8 4 1.5 1 6.2 4
Table C.12: LUT input driver E.
Type VGbuf1 buf2 buf3
PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 4.7 2 1.9 1 2 2.4PT VDD + 0.1 4.7 2 1.9 1 2 2.4PT VDD + 0.2 6.1 3 2.0 1 3.6 3TG VDD + 0.0 4.6 3 1.6 1 4.9 3TG VDD + 0.1 4.6 3 1.6 1 4.9 3TG VDD + 0.2 4.6 3 1.6 1 4.9 3
Table C.13: LUT input driver F.
Type VGbuf1 buf2 buf3
PMOS NMOS PMOS NMOS PMOS NMOS
PT VDD + 0.0 4.3 2 1.9 1 2 2.2PT VDD + 0.1 4.3 2 1.9 1 2 2.2PT VDD + 0.2 6.2 3 1.9 1 3.6 3TG VDD + 0.0 4.7 3 1.5 1 4.9 3TG VDD + 0.1 4.7 3 1.5 1 4.9 3TG VDD + 0.2 4.7 3 1.5 1 4.9 3
Appendix D
Area and Delay Breakdown
Table D.1: Tile area breakdown.
Type VG SB MUX CB MUX Local MUX LUT FF BLE Output
PT VDD + 0.0 31.5% 22.0% 16.4% 26.9% 1.1% 2.1%PT VDD + 0.1 31.3% 22.0% 16.4% 26.9% 1.2% 2.2%PT VDD + 0.2 32.1% 21.8% 16.1% 26.8% 1.1% 2.1%TG VDD + 0.0 31.5% 24.3% 17.2% 24.1% 1.0% 1.9%TG VDD + 0.1 31.7% 24.2% 17.2% 24.0% 1.0% 1.9%TG VDD + 0.2 32.0% 24.1% 17.2% 23.8% 1.0% 1.9%
Table D.2: Critical path delay breakdown.
Type VG SB MUX CB MUX Local MUX LUT FF BLE Output Other
PT VDD + 0.0 37.1% 18.4% 17.0% 20.0% 0.4% 5.1% 1.9%PT VDD + 0.1 36.7% 14.5% 16.2% 25.4% 0.4% 4.7% 2.0%PT VDD + 0.2 36.2% 12.9% 13.4% 29.4% 0.5% 4.8% 2.8%TG VDD + 0.0 41.9% 16.2% 14.8% 20.5% 0.4% 3.8% 2.3%TG VDD + 0.1 41.5% 15.4% 12.8% 22.7% 0.5% 4.2% 3.0%TG VDD + 0.2 40.1% 13.7% 12.5% 26.1% 0.5% 3.9% 3.3%
61
Bibliography
[1] FPGAs Add Comms Cores Amid ASIC Debate. http://www.eetimes.com/document.asp?doc_
id=1280575. Accessed: 2013-09-16.
[2] E. Ahmed and J. Rose. The Effect of LUT and Cluster Size on Deep-Submicron FPGA Performance
and Density. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 12(3):288–298,
2004.
[3] Altera Corp. Stratix V Device Handbook, Volume1: Device Interfaces and Integration. Data Sheet,
2013.
[4] Altera Corp. Stratix V Device Overview. Data Sheet, 2013.
[5] A. Amouri, S. Kiamehr, and M. Tahoori. Investigation of Aging Effects in Different Implementa-
tions and Structures of Programmable Routing Resources of FPGAs. In Proceedings of the 2012
International Conference on Field-Programmable Technology (FPT), pages 215–219, 2012.
[6] M. Anis, M. Allam, and M. Elmasry. Impact of Technology Scaling on CMOS Logic Styles. IEEE
Transactions on Circuits and Systems II: Analog and Digital Signal Processing, 49(8):577–588, 2002.
[7] V. Betz, J. Rose, and A. Marquardt. Architecture and CAD for Deep-Submicron FPGAs. Kluwer,
1999.
[8] Vaughn Betz and Jonathan Rose. Automatic Generation of FPGA Routing Architectures from
High-Level Descriptions. In Proceedings of the 2000 ACM/SIGDA Eighth International Symposium
on Field Programmable Gate Arrays (FPGA), pages 175–184, 2000.
[9] C. Chen, R. Parsa, N. Patil, S. Chong, K. Akarvardar, J. Provine, D. Lewis, J. Watt, R. T. Howe,
H.-S. P. Wong, and S. Mitra. Efficient FPGAs Using Nanoelectromechanical Relays. In Proceedings
of the 18th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays
(FPGA), pages 273–282, 2010.
[10] Chung-Ping Chen, C.C.N. Chu, and D. F. Wong. Fast and Exact Simultaneous Gate and Wire
Sizing by Lagrangian Relaxation. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 18(7):1014–1025, 1999.
[11] C.S. Chen and J.T. Watt. Characterization and Simulation of NMOS Pass Transistor Reliability
for FPGA Routing Circuits. In IEEE International Conference on Microelectronic Test Structures
(ICMTS), pages 216–220, 2013.
62
Bibliography 63
[12] Doris Chen, Deshanand Singh, Jeffrey Chromczak, David Lewis, Ryan Fung, David Neto, and
Vaughn Betz. A Comprehensive Approach to Modeling, Characterizing and Optimizing for Metasta-
bility in FPGAs. In Proceedings of the 18th Annual ACM/SIGDA International Symposium on Field
Programmable Gate Arrays, pages 167–176, 2010.
[13] A. R. Conn, I. M. Elfadel, W. W. Molzen, Jr., P. R. O’Brien, P. N. Strenski, C. Visweswariah, and
C. B. Whan. Gradient-Based Optimization of Custom Circuits Using a Static-Timing Formulation.
In Proceedings of the 36th Annual ACM/IEEE Design Automation Conference, pages 452–459, 1999.
[14] A.R. Conn, P.K. Coulman, R.A. Haring, G.L. Morrill, C. Visweswariah, and C.W. Wu. JiffyTune:
Circuit Optimization Using Time-Domain Sensitivities. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 17(12):1292–1309, 1998.
[15] W. C. Elmore. The Transient Response of Damped Linear Networks with Particular Regard to
Wideband Amplifiers. Journal of Applied Physics, pages 55–63, 1948.
[16] Wei Mark Fang and Jonathan Rose. Modeling Routing Demand for Early-Stage FPGA Archi-
tecture Development. In Proceedings of the 16th International ACM/SIGDA Symposium on Field
Programmable Gate Arrays (FPGA), pages 139–148, 2008.
[17] J. P. Fishburn and A. E. Dunlop. TILOS: A Posynomial Programming Approach to Transistor
Sizing. In International Conference on Computer-Aided Design (ICCAD), pages 326–328, 1985.
[18] R. Ho, K.W. Mai, and M.A. Horowitz. The Future of Wires. Proceedings of the IEEE, 89(4):490–504,
2001.
[19] ITRS. Interconnect Chapter, 2011.
[20] J. Rubinstein, P. Penfield and M. Horowitz. Signal Delay in RC Tree Networks. IEEE Transactions
on Computer-Aided Design (TCAD), 2(3):202–211, July 1983.
[21] K. Kasamsetty, M. Ketkar, and S.S. Sapatnekar. A New Class of Convex Functions for Delay
Modeling and its Application to the Transistor Sizing Problem. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, 19(7):779–788, 2000.
[22] M. Khellah, S. D. Brown, and Z. Vranesic. Modeling Routing Delays in SRAM-based FPGAs. In
Canadian Conference on VLSI, pages 6B.13–6B.18, 1993.
[23] S. Kiamehr, A. Amouri, and M.B. Tahoori. Investigation of NBTI and PBTI Induced Aging in
Different LUT Implementations. In International Conference on Field-Programmable Technology
(FPT), pages 1–8, 2011.
[24] Ian Kuon and J. Rose. Exploring Area and Delay Tradeoffs in FPGAs With Architecture and
Automated Transistor Design. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
19(1):71–84, 2011.
[25] Ian Kuon and Jonathan Rose. Measuring the Gap Between FPGAs and ASICs. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, 26(2):203–215, 2007.
Bibliography 64
[26] A. Lam, S. J E Wilton, P. Leong, and W. Luk. An Analytical Model Describing the Relationships
Between Logic Architecture and FPGA Density. In International Conference on Field Programmable
Logic and Applications (FPL), pages 221–226, 2008.
[27] Edmund Lee, Guy Lemieux, and Shahriar Mirabbasi. Interconnect Driver Design for Long Wires
in Field-Programmable Gate Arrays. Journal of Signal Processing Systems, 51(1):57–76, 2008.
[28] G. Lemieux, E. Lee, M. Tom, and A. Yu. Directional and Single-Driver Wires in FPGA Interconnect.
In Proceedings of the IEEE International Conference on Field-Programmable Technology (FPT),
pages 41–48, 2004.
[29] Guy Lemieux and David Lewis. Using Sparse Crossbars within LUT Clusters. In Proceedings of
the 2001 ACM/SIGDA Ninth International Symposium on Field Programmable Gate Arrays, pages
59–68, 2001.
[30] Guy Lemieux and David Lewis. Circuit Design of Routing Switches. In Proceedings of the 2002
ACM/SIGDA Tenth International Symposium on Field-Programmable Gate Arrays, pages 19–28,
2002.
[31] Guy Lemieux and David Lewis. Design of Interconnection Networks for Programmable Logic.
Springer, 2004.
[32] D. Lewis and J. Chromczak. Process Technology Implications for FPGAs. In IEEE International
Electron Devices Meeting (IEDM), pages 25.2.1–25.2.4, 2012.
[33] David Lewis, Elias Ahmed, Gregg Baeckler, Vaughn Betz, Mark Bourgeault, David Cashman,
David Galloway, Mike Hutton, Chris Lane, Andy Lee, Paul Leventis, Sandy Marquardt, Cameron
McClintock, Ketan Padalia, Bruce Pedersen, Giles Powell, Boris Ratchev, Srinivas Reddy, Jay
Schleicher, Kevin Stevens, Richard Yuan, Richard Cliff, and Jonathan Rose. The Stratix II Logic
and Routing Architecture. In Proceedings of the ACM/SIGDA 13th International Symposium on
Field-Programmable Gate Arrays (FPGA), pages 14–20, 2005.
[34] David Lewis, Vaughn Betz, David Jefferson, Andy Lee, Chris Lane, Paul Leventis, Sandy Mar-
quardt, Cameron McClintock, Bruce Pedersen, Giles Powell, Srinivas Reddy, Chris Wysocki,
Richard Cliff, and Jonathan Rose. The Stratix™ Routing and Logic Architecture. In Proceed-
ings of the ACM/SIGDA Eleventh International Symposium on Field Programmable Gate Arrays
(FPGA), pages 12–20, 2003.
[35] David Lewis, David Cashman, Mark Chan, Jeffery Chromczak, Gary Lai, Andy Lee, Tim Van-
derhoek, and Haiming Yu. Architectural Enhancements in Stratix V™. In Proceedings of the
ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA), pages 147–
156, 2013.
[36] Jason Luu, Ian Kuon, Peter Jamieson, Ted Campbell, Andy Ye, Wei Mark Fang, Kenneth Kent,
and Jonathan Rose. VPR 5.0: FPGA CAD and Architecture Exploration Tools with Single-Driver
routing, Heterogeneity and Process Scaling. ACM Trans. Reconfigurable Technol. Syst., 4(4):32:1–
32:23, December 2011.
Bibliography 65
[37] Imran Masud and Steven Wilton. A New Switch Block for Segmented FPGAs. In Field Pro-
grammable Logic and Applications, volume 1673 of Lecture Notes in Computer Science, pages 274–
281. 1999.
[38] Microsemi Corp. IGLOO2 FPGAs Product Brief. Data Sheet, 2013.
[39] Takumi Okamoto and Jason Cong. Buffered Steiner Tree Construction with Wire Sizing for Inter-
connect Layout Optimization. In Proceedings of the 1996 IEEE/ACM International Conference on
Computer-Aided Design, pages 44–49, 1996.
[40] J. K. Ousterhout. Switch-Level Delay Models for Digital MOS VLSI. In Papers on Twenty-five
years of electronic design automation, 25 years of DAC, pages 489–495, New York, NY, USA, 1988.
ACM.
[41] Tao Pi and Patrick J Crotty. FPGA Lookup Table with Transmission Gate Structure for Reliable
Low-Voltage Operation, December 23 2003. US Patent 6,667,635.
[42] Predictive Technology Model (PTM). http://ptm.asu.edu/, 2012.
[43] Jonathan Rose, Jason Luu, Chi Wai Yu, Opal Densmore, Jeff Goeders, Andrew Somerville, Ken-
neth B. Kent, Peter Jamieson, and Jason Anderson. The VTR Project: Architecture and CAD for
FPGAs from Verilog to Routing. In Proceedings of the 20th ACM/SIGDA International Symposium
on Field-Programmable Gate Arrays (FPGA), pages 77–86, 2012.
[44] S.S. Sapatnekar, V.B. Rao, P.M. Vaidya, and Sung-Mo Kang. An Exact Solution to the Tran-
sistor Sizing Problem for CMOS Circuits Using Convex Optimization. Computer-Aided Design of
Integrated Circuits and Systems, IEEE Transactions on, 12(11):1621–1634, 1993.
[45] J-M Shyu, Alberto Sangiovanni-Vincentelli, John P Fishburn, and Alfred E Dunlop. Optimization-
Based Transistor Sizing. IEEE Journal of Solid-State Circuits, 23(2):400–409, 1988.
[46] A.M. Smith, G.A. Constantinides, and P. Y K Cheung. FPGA Architecture Optimization Using
Geometric Programming. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 29(8):1163–1176, 2010.
[47] V. Sundararajan, S.S. Sapatnekar, and K.K. Parhi. Fast and Exact Transistor Sizing Based on
Iterative Relaxation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 21(5):568–581, 2002.
[48] Anil Telikepalli. Power vs. Performance: The 90 nm Inflection Point. Xilinx White Paper, 223,
2005.
[49] B. Tseng, J. Rose, and S. Brown. Improving FPGA Routing Architectures Using Architecture and
CAD Interactions. In Proceedings of the IEEE 1992 International Conference on Computer Design:
VLSI in Computers and Processors, pages 99–104, 1992.
[50] Henry Wong, Vaughn Betz, and Jonathan Rose. Comparing FPGA vs. Custom CMOS and the
Impact on Processor Microarchitecture. In Proceedings of the 19th ACM/SIGDA International
Symposium on Field Programmable Gate Arrays, pages 5–14, 2011.
Bibliography 66
[51] Xilinx Inc. 7 Series FPGAs Overview. Data Sheet, 2012.
[52] Xilinx Inc. 7 Series FPGAs Configurable Logic Block. Data Sheet, 2013.
[53] Saeyang Yang. Logic Synthesis and Optimization Benchmarks, Version 3.0. In Tech. Report. MCNC,
1991.