Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
WArPE 1.0
Wisconsin Architecture Power Estimator
MICRO
ARCHITECTURAL
POWER ESTIMATION TOOL
1
1 Introduction 5
2 WArPE Processor Model 7
2.1 Microarchitecture 8
2.1.1 Instruction Fetch 9
2.1.2 Instruction Decode / Dispatch Stage 11
2.1.3 Instruction Execution and Writeback 14
3 Analytical Models 17
3.1 Power Density Model 17
3.2 Analytical RAM Model 19
3.2.1 Decoder Buffer 20
3.2.2 Decoder 21
3.2.3 Wordline 24
3.2.4 Bitline 25
3.2.5 Sense Amplifier 27
3.2.6 Output driver 29
3.2.7 Generic mux 30
3.2.8 Comparator 30
3.3 Latch Model 31
3.4 Special Model for Issue Window 34
4 Options, Configuration, Output 36
2
4.1 Options 36
4.2 Configuration files 37
4.2.1 Basic configuration file 38
4.2.2 Process Technology Data File 40
4.3 Output file 41
5 File Structure 46
5.1.1 power.h 47
5.1.2 power.c 50
5.1.3 anal.h 52
5.1.4 anal.c 52
5.1.5 sim-outorder.c, main.c 56
5.2 Control Flow 57
6 References 60
Appendix 62
Index 69
3
Table of figures
Figure 1: Micro architecture of a simple superscalar processor 8
Figure 2. Table of all the activity counts associated with the fetch stage 10
Figure 3: Activity Counters associated with the decode/dispatch stage 12
Figure 4: Instruction Issue Window 13
Figure 5: Activity counters associated with the execution and the writeback stage 16
Figure 6:Decoder Buffer. 21
Figure 7:Static decoder schematic 22
Figure 8:Circuits used in the two stages 23
Figure 9: Dynamic decoder. 24
Figure 10: Word line. 25
Figure 11: Bitline 26
Figure 12: Sense Amplifier architecture 27
Figure 13: Sense Amplifier circuit. 28
Figure 14: Output driver. 29
Figure 15:n-bit comparator 30
Figure 17: A Pipeline Latch 33
Figure 18: Instruction Issue Window 35
Figure 19: Basic Configuration File 43
Figure 20 Technology File. 43
Figure 21: Output File. 45
4
1 Introduction
Power consumption (and dissipation) has become critical design considerations in
modern microprocessors. For battery powered devices, such as laptop PCs and PDAs,
total power consumption is the major issue. For high performance applications such as
servers, the need to dissipate high power requires expensive packaging and cooling
technologies. Furthermore, in large-scale systems, power consumption can be a major
operating expense.
Microprocessors can be made more power efficient at a number of levels, ranging from
the circuit level, to the gate level, all the way up to software. Our particular interest is in
improving power efficiency at the microarchitecture level. For studying and developing
power efficient microarchitectures, power estimation tools are almost essential. And an
important part of our research effort has been the development of a flexible and accurate
power estimation tool –WArPE.
WArPE uses detailed microarchitecture simulation to measure energy-consuming
activities and execution time. These simulation-derived measurements can then be turned
into power estimates, given energy estimates for each of the activities. WArPE is based
on the simplescalar simulator [1], a performance simulator widely used among academic
researchers. An important element of power estimation is the energy consumed by each
of the modeled microarchitecture-level activities. In WArPE, these energy estimates can
be supplied directly by the user as empirical data, or for many important subsystems they
can be generated via analytical models that are part of WArPE.
5
Other power estimation tools based on the simplescalar simulator have been developed
[3,4]. WArPE is distinguished from these other estimators in a number of ways.
1) It can take chip technology data as an input and scale energy numbers
appropriately,
2) The instruction fetch, decode, rename, issue pipeline is modeled in detail,
including latches.
This document describes the internal structure and usage of the WArPE tool. Section 2.0
describes the detailed structure of the simulator, including estimation methodology.
Section 3.0 describes the analytical models used, and the following section contains the
options, configuration files and output file details. Section 5.0 discusses the file structure
of the simulator.
6
2 WArPE Processor Model
WArPE models a modern dynamically scheduled superscalar processor. The processor is
divided into a number of function unit blocks (FUBs). The processor is simulated in
much the same way as a performance simulator. At the end of each cycle, the estimator
determines the activity for each FUB, and uses this activity to estimate energy consumed
by that block. The total energy consumed by all the FUBs during each cycle yields an
instantaneous power estimate, and the average over all the cycles gives an average power
estimate. The instantaneous power is useful when di/dt is of concern; it can be estimated
by computing the difference in power consumption between consecutive cycles.
The per-activity energy estimates are determined before the simulator starts. These
estimates are determined in one of the following ways.
1) RAM FUBs use a general analytical model
2) Power density model for non-RAM FUBs
3) Latch models (primarily in the instruction pipeline)
4) Special models for critical FUBs such as the issue window.
The following sections describe the overall superscalar microarchitecure, including the
specific FUBs that are modeled. This is followed by descriptions of the RAM and power
density analytical models. The latch models are described along with the instruction
pipeline, and special models are described with the specific FUB is discussed.
7
2.1 Microarchitecture
In this section we touch upon some of the details of how the individual instruction
pipeline units are modeled. The generic micro architecture of a pipelined superscalar
processor is as shown in the figure.
Figure 1: Micro architecture of a simple superscalar processor
8
The associated units include the branch prediction tables, Instruction translation look
aside buffer, data caches, data translation look aside buffers, Reorder buffer, register file,
result bus etc. For most of these we have an approximate analytical model. There is no
analytical model for the latches. We now describe some details of the power models of
each pipeline stage.
2.1.1 Instruction Fetch
The instruction fetch stage involves access to the instruction cache, itlb as well as the
branch prediction logic. The FUBs representing this stage include those for new PC
generation logic (npc), logic associated with the branch target buffer access (btblog), the
actual branch target buffer RAM structure (btbcac), the return stack buffer (rsbcac), three
FUBs for the L1 instruction cache: one associated with the logic circuits to access the
cache (il1log), another one associated with the L1 tag structure (il1tag) and the third one
for the actual physical L1 instruction cache (il1cac) and the latches at the end of the
pipeline (fdlatch). WArPE has analytical models for almost all of these FUBs. Most of
these structures being Cache/CAM like have invalidate, replacement, write back, read
and write counters associated with them. Fig2 shows a list of all the counters associated
with this stage of execution.
Counter
No.
Name of the counter Description
0 Brupdate branch update activity
1 Brlookup branch lookup activity
2 Rsbpop return stack pop activity
3 Rsbpush return stack push activity
9
4 Il1acc il1 access activity
5 Il1wbk il1 writebacks activity
6 Il1rep il1 replacements activity
7 Il1inv il1 invalidations activity
12 Il2acc il2 access activity
13 Il2wbk il2 writebacks activity
14 Il2rep il2 replacements activity
15 Il2inv il2 invalidations activity
24 Itlbmis itlb miss activity
27 Itlbacc itlb access activity
28 Itlbwbk itlb writebacks activity
29 Itlbrep itlb replacements activity
30 Itlbinv itlb invalidations activity
35 Npc next pc logic activity
69 Fdlatch_active Latch after fetch stage active
70 Fdlatch_stall Latch after fetch stage stalled
71 Fdlatch_empty Latch after fetch stage empty
Figure 2. Table of all the activity counts associated with the fetch stage
In an attempt to build power numbers for these structures we try to map these tables to
an approximate Cache structure. The CACTI tools, which are used by almost all the
existing simulators, do this mapping for us. CACTI tools find an optimal cache structure
for each of these tables by taking in parameters like the cache size, associativity and the
no. of sets. The tool maps these structures to an optimal size cache assuming that some
cache optimizations would have been done at the circuit level and return an optimal
mapping. The numbers of row and column decoders are thus calculated. The power 10
models for the caches and the decoders are the same as suggested by Wilton and Jouppi
[2]. Currently, there are no analytical models for either the write back or the replacement
or the invalidation logic circuits. But the simulator maintains a count of these activities.
To calculate the power we multiply the activity counts with some approximate power
numbers as obtained from the industry. However, the user can input any numbers and
hence customize the simulator.
At the end of the fetch stage is a set of pipeline latches, which may be of variable width.
These latches may be in Active, Stalled or Empty state with each stage consuming a
different amount of energy. The simulator keeps an account of the number of latches in
each stage per cycle. This gives the power consumed each cycle by the latches. More
detail on the latch power model follows in sec 3.3.
2.1.2 Instruction Decode / Dispatch Stage
The decode stage entails the decoders as well as the register aliasing table associated with
the Register Renaming Logic. These units are represented in the simulator with FUBs for
dispatch queue (dispatchq), instruction decoder (decodepla), logic associated with
decoder for handling mispredictions (decodemisp), logic associated with stalling decoder
(decodestall), register aliasing table (ratarr), FUBs for input/output dependence check
(ratidep, ratodep), register aliasing table stall (ratstall) and the latches at the end of the
pipe stage (dilatch). There are counters associated with decoder stall and mispredict
activity as well as with the decoder access itself. The register aliasing table has counters
associated with the table itself as well as with input and output dependence checking
activity. A list of all the counters is given in fig3. Presently, we have analytical model
only for the register aliasing table cache. Rest of the activity counters are multiplied with
the power numbers obtained from the user input file (pfa mode).
11
Counter
No.
Name of the Counter Description of the counter
36 Dispatchqrd dispatchq read activity
37 Dispatchqwr dispatchq write activity
38 Dispatchqrel dispatchq release activity
39 Dispatchqrec dispatchq recover activity
40 Decoder decoder activity
41 Decodemispchk decoder mispredict detect activity
42 Decodemisp decoder mispredict correction
activity
43 Decodestallchk decoder stall detect activity
44 Decodestall decoder stall block activity
45 Ratidep rat idep allocation activity
46 Ratodep rat odep allocation activity
47 Ratstallchk rat stall detection activity
48 Ratstall rat stall block activity
72 Dilatch_active Latch after decode stage active
73 Dilatch_stall Latch after decode stage stall
74 Dilatch_empty Latch after decode stage empty
Figure 3: Activity Counters associated with the decode/dispatch stage
The instruction thus decoded are moved into another set of latches which again may be of
variable size and variable number of latches could be there. These latches may model the
delay associated with the renaming logic or the actual decoding of the instruction. As
before the latches could be in one of the three states: Active, Stalled or Empty with
different power numbers that may be the same as for the previous latches. We maintain a
12
per cycle record of the state in which the latches are (Dilatch_active,
Dilatch_stall, Dilatch_empty) and calculate the per cycle contribution to total
power.
Figure 4: Instruction Issue Window
Another innovative idea with this power simulator is in the issue window. The simulator
models both Collapsible and Non Collapsible instruction issue window with the same
FUB: isw. There would be some power associated with collapsing the instruction
window. The simulator has counter to record these movements per cycle (Iswcolmoved)
and the user can supply the power associated with these movements. The issue window
can also be viewed as a set of fixed length latches with the same three states as before.
13
The Active state (Iswact) now corresponds to the number of instruction ready to be issued
that cycle while the stalled state (Iswstall) would correspond to instruction that are still
waiting for their operands to become ready. The empty state (Iswempty) would represent
the in-occupancy of the issue window each cycle. A detailed power model for the same is
explained in sec3.4
2.1.3 Instruction Execution and Writeback
The instructions selected are then issued to the corresponding Functional Units or are
stored in the Load/Store queues. The FUBs for this stage include those for the integer
functional units (fuint), floating point functional units (fufp), the L1 data cache logic
circuit (dl1log), L1 data cache tag structure(dl1tag), L1 data cache (dl1cac) and similarly
for the united L2 cache (ul2log, ul2tag, ul2cac), the load/store queue (lsqrdyq), the data
tlb (dtlbcac). The simulator does not have an analytical model for any of the functional
units but the load/store queues can be modeled as a pair of cache like structure along with
a CAM like structure with analytical models for both of them. Another structure
associated with the execution stage is the data cache. The simulator models the data
cache on the same lines as the instruction cache using the CACTI tools. There are
counters for data cache access(dl2acc), write back(dl2wbk), replacement(dl2rep) and
invalidation(dl2inv). The data tlb is also modeled on the lines of the instruction tlb and
hence has the CAM like analytical model. The results as generated from the functional
units are broadcasted through the result bus. But the current version of the simulator
doesn’t calculate the power consumed by this result bus.
All the activities associated with the initialization and the utilization of the register update
unit are represented with the FUBs for ruu array (ruuarr), the ruu writeback (ruuwb). A
14
complete list of all the FUBS and all the counters are included in the appendix to this
manual. The list of counters associated with this stage is as follows:
Counter No. Name of the counter Description of the counter
8 Dl1acc dl1 access activity
9 Dl1wbk dl1 writebacks activity
10 Dl1rep dl1 replacements activity
11 Dl1inv dl1 invalidations activity
16 Dl2acc dl2 access activity
17 Dl2wbk dl2 writebacks activity
18 Dl2rep dl2 replacements activity
19 Dl2inv dl2 invalidations activity
20 Ul2acc ul2 access activity
21 Ul2wbk ul2 writebacks activity
22 Ul2rep ul2 replacements activity
23 Ul2inv ul2 invalidations activity
25 Dtlbmis dtlb miss activity
26 Ul2mis ul2 miss activity
31 Dtlbacc dtlb access activity
32 Dtlbwbk dtlb writebacks activity
33 Dtlbrep dtlb replacements activity
34 Dtlbinv dtlb invalidations activity
45 Ratidep rat idep allocation activity
46 Ratodep rat odep allocation activity
47 Ratstallchk rat stall detection activity
48 Ratstall rat stall block activity
49 Ruuarr ruu array activity
50 Ruurdyqsch ruu readyq allocation activity
51 Ruurec ruu recover activity
15
52 Ruuret ruu retire activity
53 Ruurdyqcam ruu readyq dependence check activity
54 Ruurdyqrel ruu readyq resource release activity
55 Lsqarr lsq array activity
56 Lsqrdyqsch lsq readyq allocation activity
57 Lsqrec lsq recover activity
58 Lsqret lsq retire activity
59 Lsqrdyqcam lsq readyq dependence check activity
60 Lsqrdyqrel lsq readyq resource release activity
61 Ruuarb ruu arbitration activity
62 Ruuwb ruu writeback scheduler activity
63 Ruuwbq ruu writebackq activity
64 Lsqarb lsq arbitration activity
65 Lsqwb lsq writeback scheduler activity
66 Lsqwbq lsq writebackq activity
67 Fuint functional unit integer
68 Fufp functional unit floating point
Figure 5: Activity counters associated with the execution and the writeback stage
16
3 Analytical Models
The architectural power estimation methodologies can be broadly classified into
empirical methods and analytical methods. These can further be classified into fixed
activity and activity sensitive methods. One of the earliest methods of power estimation
was a fixed activity method called the Power Factor Approximation method (PFA)
described by Liu and Svensson [5]. Power estimation techniques have come a long way
since then, with activity-based models, transition sensitive models and so on. The basic
estimation methodology is, however, the same. We basically either calculate the power
density constants associated with each structure as in the analytical model or take the
power constants as input from the user, pfa model.
3.1 Power Density Model
Several architectural power estimation schemes have been discussed in literature [6][7].
In WArPE we use a scheme similar to Power Factor Approximation (PFA) [5]. We
express the power dissipation in terms of the active/inactive power density of each FUB,
the area of the FUB and the activity factor, which is determined via performance
simulation.
power = {(active power density)*(activity) + (inactive power density)*(1–activity)}*area
The power density and area numbers are either determined empirically from the real
design and scaled to the required technology or are estimated by considering circuit 17
complexity, logic styles, etc. The power density numbers are further divided based on the
following circuit styles:
Dynamic logic
Static logic
PLA circuits
Memory type regular circuits
Clock circuits
Thus for every FUB, one has to define 5*3 = 15 different numbers, corresponding to
active power density, inactive power density and area for each of the five circuit styles.
The user can supply this through the configuration file. However, it is not always possible
to get/estimate these numbers. In order to overcome this problem we have included
routines, which can analytically model FUBs. Presently, we can construct models for
most regular memory type structures like caches, register files, register renaming tables,
branch target buffers and reorder buffers. The simulator is designed in such a way that
models can be updated and new models can be added relatively easily.
In order to take physical structure into consideration, a few more options have been
added. The analytical models can, and in fact will, have to be refined continuously to get
improve result accuracy. Models for other regular structures like PLAs can also be added.
18
3.2 Analytical RAM Model
In the analytical mode, power constants are generated using analytical models provided.
Presently, we have the capability to model most of the regular and simple logic based
structures. The models are based on the circuit time-delay-energy simulation model that
is similar to those used by Wilton and Jouppi [2]. The idea is to break FUBs into smaller
components, for which analytical models are present. The analytical models used in the
simulator are similar to those used by Wilton and Jouppi [2]. Some of the differences
include a choice of static vs. dynamic logic for decoder and single ended read option for
register files. These models can be used to construct power constants for FUBs that
contain regular, memory type building blocks. The FUBs that have already been modeled
are the instruction and data caches, TLBs, branch target cache, register allocation table
and return address stack. Other units that can be modeled are the register update unit and
load/store queue arrays.
For example, a cache can be divided into a decoder buffer, row decoder, word-lines, bit-
lines, sense amplifiers, column decoder and output MUXs. The models generate power
numbers by calculating the effective switching capacitance. The effective capacitance is
estimated by adding the gate, drain and routing capacitances together. These are
calculated by functions that take the width and length of Poly used, as inputs. The length
of all transistors is assumed to be constant and equal to the Leff defined in the
technology file. The list of these functions (included in anal.c) follows.
19
gatecap(): return the gate capacitance of the transistor.
gatecappass(): returns the gate capacitance for a pass transistor.
draincapp(): returns drain capacitance for the p-type transistor. It has an added feature
of optimizing for stacked transistors, example the n-type transistors in a 4-
input NAND.
draincapn(): similar function for n-type transistor.
The following sections describe each of the basic models provided. An example of the
usage of these models to create more complex models will be given in the last chapter.
3.2.1 Decoder Buffer
The decoder buffer, as the name suggests, buffers the address lines that go into the
decoders. The buffer is an important element if the address lines feed into a large number
of gates. Presently, the sizes of the buffer transistors are fixed. These could be changed
depending on the number of gates connected to the lines and the speed required. The
following figure shows the buffer architecture.
20
Figure 6:Decoder Buffer.
3.2.2 Decoder
Two types of decoder models have been included, depending on the type of circuits they
use. The first one is a static decoder that is based on a two level decoding scheme. The
first stage is constructed from 3x8 and 2x4 NAND based decoders. The second stage
21
.
.
.ADDR BITS * 2(BIT and NBIT)
VDD
GND
Decoder Buffer Single Buffer
consists of an n-input OR for every output bit, where n is the number of min terms in
stage 1. The following schematic brings out the basic architecture of this decoder.
Figure 7:Static decoder schematic
22
Stage 1 Stage 2
N decoders
3x8,2x4
using
NAND gates
N input NOR gate
Figure 8:Circuits used in the two stages
23
.
.
BIT
and NBIT
Outputs from
four
2-input
NAND gates
Structure of decoder
Eg. 2x4 decoder
Second stage NOR gate
Eg. 4-input NOR
out
The second type of decoder is the dynamic decoder, which is based on a domino NOR.
However, the maximum inputs that should be allowed for this decoder is around six. The
following figure shows a schematic of the dynamic decoder.
Figure 9: Dynamic decoder.
3.2.3 Wordline
The wordline power model includes both the wordline as well as the wordline driver. The
driver size is computed using a function called WLdriver_size(). The inputs to this
function are the capacitance driven and the rise-time expected. The rise-time has been
assumed to be period/8 due to lack of data. This can be changed by changing the entry in
tech.h. The model also takes into account single ended read type cells, used in register
files. A schematic of the wordline is shown below.
24
out
….
precharge
Figure 10: Word line.
3.2.4 Bitline
The bitline model takes into account the precharge transistors, line capacitance and
isolation transistors. Several minute features have been added and detailed comments in
the code explain these. The basic schematic of the bitline is shown below.
25
columns
Wordline
driver
Figure 11: Bitline
26
Precharge Precharge
equalizer
rows
Columns
Isolation
Pass gate
3.2.5 Sense Amplifier
The sense amplifier is shared by many bitlines using a column MUX. However, one
should not multiplex more than eight bitlines together due to leakage issues. The MUX is
a standard pass-gate based MUX with a column decoder. The basic architecture and the
sense amplifier circuit used are shown below.
colmux
Figure 12: Sense Amplifier architecture
27
Sense Amplifier
MUX MUX
BIT BITN
OUT
Figure 13: Sense Amplifier circuit.
28
BIT BITN
Vdd Vdd Vdd Vdd
Vdd VddGnd
Gnd
Gnd
3.2.6 Output driver
The output driver uses an array of tri-state drivers like the one shown in the schematic
below.
Figure 14: Output driver.
29
VDD
GND
sel
Sense amp out
out
3.2.7 Generic mux
This is a standard pass-gate based MUX. The only specifications required are the number
of inputs to be multiplexed into one bit and the number of output bits. The generic MUX,
as the name suggests, can be used to model a general MUX.
3.2.8 Comparator
The comparator design is shown in Fig. 15.
Figure 15:n-bit comparator
30
a0na0
nb0b0
prechargeVdd
out
# of bits to compare
3.3 Latch Model
At the end of the fetch stage is the pipeline latches associated with the fetch stage. These
pipeline latches are basically modeling the delay incurred between moving instruction
from the fetch stage to the decode stage. These delays could be due to the delay in BTB
lookup or in getting the branch prediction. The latches could be of variable size and the
number of latches would also vary depending upon the delay to be modeled. The variable
length of the latches is due to the fact that some information may be added on a later latch
in the pipeline. At any time these latches could be in one of the three states: Active
implying that a new instruction was moved into this latch that cycle, Stalled meaning that
the latch is holding on to the instruction that it had in the previous cycle this cycle also,
Empty meaning that the latch is not storing anything that cycle. The power associated
with each of these states would be different and is read from the input file.
31
Figure 16: Simple Architecture along with the Pipeline latches
This breakdown of energy-consuming activity allows for a form of clock gating where
active instructions may consume more energy than stalled instructions, and where valid
instructions may consume more energy than invalid ones (i.e. empty pipeline slots). For
example, consider the logic shown in Figure 17. Here, a typical pipeline latch is shown,
as might appear in the decode pipeline. An input multiplexor (typically built into the
latch) is used to "recirculate" latched pipeline values when the hold signal is active. In
addition, the valid bit from the preceding stage is used to gate the latch itself; if there is
no valid data being fed into the latch, then the latch is not clocked.
32
Figure 17: A Pipeline Latch
A Valid Bit from the previous stage is used to gate the clock signal. A hold signal from the succeeding stage is used to switch the
multiplexor and recirculate data being stalled.
In this system, a certain amount of energy is consumed if an instruction moves up the
pipeline (the hold signal is inactive) and is latched into the next stage. A different (lower)
amount is consumed if the hold signal is active, the multiplexor feeds the same data back
into the latch and the latched is clocked, but the logic following the latch does not see any
of its inputs change. Finally, a different (still lower) amount of energy is consumed if the
valid signal is off, and the latch is not clocked at all. Similarly, in the issue queue, a
particular issue queue slot may consume different amounts of energy depending on
whether or not it holds an active instruction and whether or not the instruction actually
issues. The pipeline latches were taken from a high-end design environment. A 2-to-1
static mux was used to re-circulate the data when stalled. Each cycle the simulator
maintains an account of latches in various states and the total power the latches would
consume each cycle. This is one of the innovative ideas in this simulator33
LatchMU X
clo ck
Valid from previo us stage
data indata out
hold from next stag e
3.4 Special Model for Issue Window
As stated before, the simulator models both Collapsible and Non Collapsible instruction
issue window with the same FUB: isw. There would be some power associated with
collapsing the instruction window. The simulator has counter to record these movements
per cycle (Iswcolmoved) and the user can supply the power associated with these
movements. The issue window can also be viewed as a set of fixed length latches with
the same three states as before. The Active state (Iswact) now corresponds to the number
of instruction ready to be issued that cycle while the stalled state (Iswstall) would
correspond to instruction that are still waiting for their operands to become ready. The
empty state (Iswempty) would represent the in-occupancy of the issue window each
cycle.
34
Figure 18: Instruction Issue Window
For the issue queue, wakeup logic is modeled by counting the energy in the comparators.
For the selection logic, energy of one arbiter cell was supplied. Then the number of
arbiter cells per arbiter was calculated based on the number of entries in the issue queue.
We assume one arbiter per issue port – in our case four issue ports. Every entry in the
issue queue has some comparators (for tag match). The wakeup logic associated with this
issue window involves tag comparison and has a level of XOR gates followed by NAND
gates. Assuming that the NAND gates are smaller than the XOR, the simulator records
the power consumed in these XOR gates each cycle. There are counters associated with
each of the states of the issue window latches as well as with data movement between
these latches for a collapsible window.
35
4 Options, Configuration, Output
This section describes the options, configuration files and output files used in the WArPE
power estimation tool.
4.1 Options
The estimator options (in addition to the underlying simplescalar options) are defined
below. These options have been registered in the original simplescalar option database.
Implementing these options required modification of some of the original sim-
outorder.c code.
–power_config <filename>: This option specifies the power simulator
configuration file. The file must
read permissions. The default file name is
power.txt.
–power_outfile <filename>: This option specifies the file into which output
statistics are dumped. The default file name is
power_output.txt.
36
–tech_file <filename>: This option specifies the technology definition
file name. The file must have read permissions.
The default file name is technology.def.
–technology <technology>: This option specifies the power simulation
technology. The technology is defined by an
identifier listed in the technology file.
Eg. –technology 0.25um. The default
technology is 0.8um.
–sim_limit <limit>: This option specifies the number of instructions (in
millions) at which the simulation stops and data is
dumped into the output file.
4.2 Configuration files
Following is a description of the various configuration files used in the WArPE estimator.
Configuration files provide an easy and effective way of defining the large number of
parameters used in the simulator.
37
4.2.1 Basic configuration file
This is the file defined by the –power_config option. It defines the power densities,
areas, mode of operation i.e. pfa (empirical) or anal (analytical model), power thresholds,
and physical partitioning parameters. This file can be generated by saving a Microsoft
Excel worksheet in tab delimited text format.
The file has three main option:
1) –global <max. power threshold> <max. di/dt threshold>
These define the power and di/dt thresholds for the full chip. The unit is watts.
2) <unit> <mode> <maxpowerth> <maxdidtth> <dyn_pda> <dyn_pdi> <dyn_a>
<sta_pda> <sta_pdi> <sta_a> <clk_pda> <clk_pdi> <clk_a> <mem_pda>
<mem_pdi> <mem_a> <pla_pda> <pla_pdi> <pla_a>
unit: name of the FUB (Functional Unit Block) as defined in power_init().
mode: pfa: directs the simulator to use empirical data i.e. dyn_pda,…,pla_a.
anal: directs the simulator to use analytical model for the FUB.
maxpowerth: maximum power threshold for the FUB.
maxdidtth: maximum di/dt threshold for the FUB.
dyn_pda: dynamic circuit power density - active
dyn_pdi: dynamic circuit power density - inactive
dyn_a: dynamic circuit area
sta_pda: static power density – active
sta_pdi: static circuit power density – inactive
38
sta_a: static circuit area
clk_pda: clock circuit power density – active
clk_pdi: clock power density – inactive
clk_a: clock circuit area
mem_pda: memory type circuit power density – active
mem_pdi: memory type circuit power density – inactive
mem_a: memory type circuit area
pla_pda: PLA power density – active
pla_pdi: PLA power density – inactive
pla_a: PLA circuit area
The units of the power densities are W/m2, and the units of area are m2.
3) -<unit name> <nwl> <nbl> <nsp> <logic_style> <rd mode>
Eg. –itlbcac 1 2 1 static dual
This option specifies the physical partition. In the example given above, it
defines the partition for itlb. The names specified with a “-“ followed by the FUB
name.
<nwl> : The number of partitions of the wordline. Each partition has a
different decoder and wordline driver. The partitions however
share sense amplifiers.
<nbl> : The number of partitions of the bitline. Each partition has separate
sense amplifiers and decoders.
<nsp> : Similar to bitline partition but shares decoder. 39
<logic_style> : The type of logic used for decoders, static or dynamic.
<rd mode> : Defines the read mode i.e. dual for dual rail and single for single
ended (used in small register files).
4.2.2 Process Technology Data File
This file contains the processing technology data for several generations. It must at least
contain the data for the technology defined by the –technology option. Some of the data
provided in the technology file is not used presently. It will used in later revisions, e.g.
for dual Vt technologies. The format for the technology data is as follows
<tech> <Leff> <Vdd> <f> <Vtl> <Vth> <Iol> <Ioh>
Eg. 0.8um 0.80 5.00 100 0.75 0.75 1 1
<tech>: Technology identifier. It should match the identifier supplied using the
–technology option.
<Leff>: The effective channel length in microns.
<Vdd>: The drain voltage used in the technology.
<f>: The clock frequency in MHz.
<Vtl>: For use in dual voltage circuits. This is the lower threshold voltage.
<Vth>: Higher threshold voltage.
40
<Iol>: Leakage current for the lower threshold voltage in nA/m.
<Ioh>: Leakage current for the higher threshold voltage in nA/m.
4.3 Output file
This file contains the output power statistics generated after the simulated instructions
reach sim_limit or the simulation ends. The file is well formatted and the data is self-
explanatory. Sample configuration files and output file are shown below.
-global 10 10
Npclog pfa 1 1 7.72 0.772 3.20E+046.05 0.6052.56E+05 8.43 8.43 3.20E+04 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00
Btblog pfa 1 1 7.72 0.772 0.00E+006.05 0.6052.49E+05 8.43 8.43 1.31E+04 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00
Btbcac anal 1 1 7.72 0.772 1.50E+056.05 0.6059.00E+05 8.43 8.43 1.50E+05 10.75 1.075 1.80E+06 91.75 9.175 0.00E+00
Rsbcac anal 1 1 7.72 0.772 3.85E+046.05 0.6057.70E+04 8.43 8.43 1.93E+04 10.75 1.075 5.78E+04 91.75 9.175 0.00E+00
Itlbcac anal 1 1 7.72 0.772 1.50E+056.05 0.6053.00E+05 8.43 8.43 3.75E+04 10.75 1.075 2.63E+05 91.75 9.175 0.00E+00
dtlbcac anal 1 1 7.72 0.772 1.20E+046.05 0.6054.00E+05 8.43 8.43 4.00E+04 10.75 1.075 2.40E+05 91.75 9.175 0.00E+00
pmhlog pfa 1 1 7.72 0.772 6.00E+046.05 0.6052.00E+05 8.43 8.43 2.00E+04 10.75 1.075 1.20E+05 91.75 9.175 0.00E+00
il1log pfa 1 1 7.72 0.772 2.40E+056.05 0.6051.68E+06 8.43 8.43 2.40E+05 10.75 1.075 2.40E+05 91.75 9.175 0.00E+00
il1tag anal 1 1 7.72 0.772 5.28E+056.05 0.6057.92E+05 8.43 8.43 2.64E+05 10.75 1.075 3.70E+06 91.75 9.175 0.00E+00
il1cac anal 1 1 7.72 0.772 0.00E+006.05 0.6051.32E+06 8.43 8.43 3.30E+05 10.75 1.075 4.95E+06 91.75 9.175 0.00E+00
dl1log pfa 1 1 7.72 0.772 3.60E+056.05 0.6051.68E+06 8.43 8.43 1.20E+05 10.75 1.075 2.40E+05 91.75 9.175 0.00E+00
dl1tag anal 1 1 7.72 0.772 2.64E+056.05 0.6057.92E+05 8.43 8.43 2.64E+05 10.75 1.075 3.96E+06 91.75 9.175 0.00E+00
41
dl1cac anal 1 1 7.72 0.772 0.00E+006.05 0.6051.32E+06 8.43 8.43 3.30E+05 10.75 1.075 4.95E+06 91.75 9.175 0.00E+00
dispatchq pfa 1 1 7.72 0.772 6.50E+056.05 0.6054.88E+05 8.43 8.43 1.63E+05 10.75 1.075 3.25E+05 91.75 9.175 0.00E+00
decodepla pfa 1 1 7.72 0.772 3.20E+046.05 0.6054.80E+04 8.43 8.43 1.60E+04 10.75 1.075 0.00E+00 91.75 9.175 6.40E+04
decodemisp pfa 1 1 7.72 0.772 0.00E+00 6.05 0.605 7.43E+04 8.43 8.43 8.25E+03 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00
decodestall pfa 1 1 7.72 0.772 0.00E+00 6.05 0.605 5.23E+04 8.43 8.43 2.75E+03 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00
ratarr anal 1 1 7.72 0.772 2.08E+05 6.05 0.605 5.20E+05 8.43 8.43 5.20E+04 10.75 1.075 2.60E+05 91.75 9.175 0.00E+00
ruuarr pfa 1 1 7.72 0.772 9.10E+04 6.05 0.605 1.82E+05 8.43 8.43 4.55E+04 10.75 1.075 1.37E+05 91.75 9.175 0.00E+00
lsqarr pfa 1 1 7.72 0.772 4.55E+04 6.05 0.605 9.10E+04 8.43 8.43 2.28E+04 10.75 1.075 6.83E+04 91.75 9.175 0.00E+00
ruurdyq pfa 1 1 7.72 0.772 1.50E+04 6.05 0.605 2.00E+04 8.43 8.43 2.50E+03 10.75 1.075 1.25E+04 91.75 9.175 0.00E+00
lsqrdyq pfa 1 1 7.72 0.772 7.50E+03 6.05 0.605 1.00E+04 8.43 1250 4.00E+04 10.75 1.075 6.25E+03 91.75 9.175 0.00E+00
ruuarb pfa 1 1 7.72 0.772 1.05E+05 6.05 0.605 6.30E+05 8.43 8.43 1.05E+05 10.75 1.075 2.10E+05 91.75 9.175 0.00E+00
ruuwb pfa 1 1 7.72 0.772 2.00E+05 6.05 0.605 1.20E+06 8.43 8.43 2.00E+05 10.75 1.075 4.00E+05 91.75 9.175 0.00E+00
lsqarb pfa 1 1 7.72 0.772 1.05E+05 6.05 0.605 6.30E+05 8.43 8.43 1.05E+05 10.75 1.075 2.10E+05 91.75 9.175 0.00E+00
lsqwb pfa 1 1 7.72 0.772 2.00E+05 6.05 0.605 1.20E+06 8.43 8.43 2.00E+05 10.75 1.075 4.00E+05 91.75 9.175 0.00E+00
fuint pfa 1 1 7.72 0.772 8.50E+04 6.05 0.605 2.38E+05 8.43 8.43 1.70E+04 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00
fufp pfa 1 1 7.72 0.772 1.13E+05 6.05 0.605 3.15E+05 8.43 8.43 2.25E+04 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00
ul2log pfa 1 1 7.72 0.772 1.44E+05 6.05 0.605 6.72E+05 8.43 8.43 4.80E+04 10.75 1.075 9.60E+04 91.75 9.175 0.00E+00
ul2tag anal 1 1 7.72 0.772 3.60E+05 6.05 0.605 2.88E+06 8.43 8.43 3.60E+05 10.75 1.075 3.60E+06 91.75 9.175 0.00E+00
ul2cac anal 1 1 7.72 0.772 1.50E+06 6.05 0.605 6.00E+06 8.43 8.43 0.00E+00 10.75 1.075 2.25E+07 91.75 9.175 0.00E+00
Biu pfa 1 1 7.72 0.772 5.00E+05 6.05 0.605 4.00E+06 8.43 8.43 5.00E+05 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00
fdlatch_0 pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0
fdlatch_1 pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0
fdlatch_3 pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0
fdlatch_4 pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0
dilatch_0 pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0
dilatch_1 pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0
42
dilatch_2 pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0
dilatch_3 pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0
isw pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0
-dl1cac 1 1 1 static dual -dl1tag 1 1 1 static dual
-dl2cac 1 1 1 static dual -dl2tag 1 1 1 static dual
-il1cac 1 1 1 static dual -il1tag 1 1 1 static dual
-il2cac 1 1 1 static dual -il2tag 1 1 1 static dual
-dtlbcac 1 1 1 static dual
-itlbcac 1 1 1 static dual
-btbcac 1 1 1 static dual
-regfile 1 1 1 static single
Figure 19: Basic Configuration File
tech L(um) Vdd(V) f(MHz) Vtl(V) Vth(V) Iol(nA/um) Ioh(nA/um)
0.8um 0.80 5.00 100 0.75 0.75 0.01 0.01
0.6um 0.60 3.30 200 0.65 0.65 0.01 0.01
0.35um 0.35 2.50 300 0.55 0.55 0.1 0.1
0.25um 0.25 1.50 450 0.45 0.45 0.1 0.1
0.18um 0.18 1.05 700 0.35 0.35 1 0.1
0.15um 0.15 1.00 1000 0.30 0.35 1 0.1
0.13um 0.13 1.00 1500 0.28 0.35 1 0.1
43
0.1um 0.10 0.75 2250 0.25 0.35 1 0.1
0.07um 0.70 0.60 3300 0.25 0.35 10 0.1
Figure 20 Technology File.
Sun May 19 17:07:59 2002
Power simulation checkpoint at 200000051 instructions
functional cumulative maximum maximum maximum power maximum didt
block name power power didt power violations violations
npclog 4.354e+06 8.262e+06 7.813e+06 0 0
btblog 6.775e+05 8.097e+06 7.835e+06 0 0
btbcac 1.59e+06 2.135e+07 2.092e+07 0 0
itlbcac 2.293e+05 4.446e+05 4.335e+05 0 0
rsbcac 3.414e+05 1.546e+06 1.245e+06 0 0
dtlbcac 4.024e+06 3.801e+07 3.716e+07 0 0
pmhlog 4.667e+05 3.132e+06 3.132e+06 0 0
il1log 3.548e+07 6.648e+07 6.3e+07 0 0
il1tag 1.071e+08 2.033e+08 1.962e+08 0 0
il1cac 1.062e+07 2.029e+07 1.979e+07 0 0
dl1log 1.338e+07 1.819e+08 1.628e+08 0 0
dl1tag 4.12e+07 5.679e+08 5.091e+08 0 0
dl1cac 1.485e+07 2.117e+08 1.905e+08 876705 0
dispatchq 0 0 0 0 0
decodepla 0 0 0 0 0
decodemisp 0 0 0 0 0
decodestall 0 0 0 0 0
ratarr 8.569e+07 2.715e+08 2.384e+08 0 0
ruuarr 2.734e+07 1.864e+08 1.133e+08 0 0
lsqarr 4.258e+06 2.924e+07 2.741e+07 0 0
44
ruurdyq 1.041e+06 7.845e+06 6.668e+06 0 0
lsqrdyq 7.525e+06 2.3e+07 1.464e+07 0 0
ruuarb 3.15e+07 2.795e+08 1.242e+08 0 0
ruuwb 7.137e+07 1.775e+08 1.745e+08 0 0
lsqarb 3.267e+07 2.795e+08 1.242e+08 0 0
lsqwb 2.487e+07 1.627e+08 1.597e+08 0 0
fuint 3.489e+06 8.958e+06 8.605e+06 0 0
fufp 4.671e+05 5.928e+06 5.461e+06 0 0
ul2log 1.833e+06 5.953e+07 5.85e+07 0 0
ul2tag 1.653e+07 5.574e+08 5.485e+08 0 0
ul2cac 1.352e+07 8.154e+08 8.102e+08 0 0
biu 8.242e+06 2.582e+08 2.512e+08 0 0
isw 1.625e+06 0 1.311e+06 0 0
fdlatch_0 6.458e+04 9.83e+04 7.782e+04 0 0
fdlatch_1 6.442e+04 9.83e+04 7.782e+04 0 0
fdlatch_2 6.387e+04 9.83e+04 7.782e+04 0 0
fdlatch_3 6.329e+04 9.83e+04 7.782e+04 0 0
dilatch_0 6.24e+04 9.83e+04 7.782e+04 0 0
dilatch_1 6.167e+04 9.83e+04 7.782e+04 0 0
dilatch_2 6.133e+04 9.83e+04 7.782e+04 0 0
dilatch_3 5.725e+04 9.83e+04 7.782e+04 0 0
Global statistics:
Total power = 566797441.827776
Maximum power = 3490027519.397630
Maximum didt power = 3198001037.129858
Power violations = 19489894
Didt power violations = 1204832
45
Figure 21: Output File.
5 File Structure
The simulator is essentially based on Simplescalar [1]. Care has been taken to keep the
power simulation functions in separate files thus minimizing the modification of the
original code. However, at some places it was inevitable or rather much more convenient
to modify the original Simplescalar files. The file structure is as follows.
power.c: The main power number generation file. It contains routines for power
calculation. Any new power calculation routines, eg. Clock gated
power calculation should be included in this file.
power.h: This file contains all the declarations for variables, structures and
functions and definitions used in power.c.
anal.c: Contains all the analytical models. Any new models developed should
be placed in this file.
anal.h: Contains declarations and definitions for variables and functions used
in anal.c.
tech.c: Technology processing file. Reads from the technology file and
calculates scaling factors for the required technology .The base
technology used is 0.8 um and all simulations are performed by scaling
the 0.8um technology. 46
tech.h: Contains all the device size definitions for 0.8 um base technology.
sim-outorder.c and main.c have also been modified as described later.
5.1.1 power.h
As mentioned earlier, power.c contains routines for power computation and power.h is
the supporting header file. The simulator is designed using a FUB-centric approach. All
the power numbers specific to an FUB is stored together in one structure. The structure is
shown below. Not all the elements are used. Some of them are present for future
expansion.
typedef struct {
char name[32];
double active_power;
double active_power_rd;
double active_power_wr;
double static_power;
double inactive_power;
double active_power_lt;
double stall_power_lt;
double empty_power_lt;
double active_power_cg;
double active_power_wr_cg;
47
double active_power_rd_cg;
double inactive_power_cg;
double maxpowerth
double maxdidtth;
double cum_power;
double prev_power;
double max_power;
double max_didt;
double max_powerx;
double max_didtx;
} fub_t;
The element name stores the name of the FUB, which can be at most 32 characters in
length. The next four elements store power numbers, which are obvious from their
names. It should be noted that active power comes in three flavors. When using the
empirical method, only active_power is used. It is the sum of the (power
density)*(area) products for the five different circuit styles. When analytical models are
used, the read and write operations can be separated and these give different power
consumptions thus the rd and wr suffixes. The element inactive_power is presently
redundant but can be used in the empirical mode for standby mode. The next three
numbers are power values for latches only. The next four elements are the clock gated
power numbers which are presently not being used. Notice that clock gating does not
affect static power and hence static_power_cg is not present. The elements
maxpowerth and maxdidtth are the maximum power and maximum di/dt power
thresholds for the FUB. These values are defined in the configuration file. cum_power
48
keeps accumulating the power after every cycle and is finally divided by the number of
cycles to get the average power dissipated. prev_power, max_power and
max_didt are the previous cycle power, maximum power and maximum di/dt power
respectively. Finally, max_powerx and max_didtx keep track of the number of
threshold violations.
A similar structure of type glb_power_t is used to track the full chip power numbers.
Its elements are essentially the sum of the corresponding elements of the FUB structures.
Another important structure defined is the power_t, which is used to exchange power
numbers. Its got three elements, active_power_rd, active_power_wr and
static_power which are self-explanatory.
The activity counts are tracked using two arrays of counters, one for present cycle counts
and the other for cumulative counts. Specific counters can be accessed by using the
counter name as the index, Eg. pres_count[Ruuarr]. Ninety three counters have
presently been declared. New counters can be added simply by adding their names to the
#define list and updating NUM_POWER_COUNTERS. As a convention, only the first
character of the counter name is in caps.
As more and more features are added to the simulator, new elements can be added to
these structures and new counters can be defined for more detail/functionality. This
makes the simulator amenable to future development.
49
Finally, there is a structure, which is used to maintain the power parameter database. The
structure type is called power_db. It stores the following data
name: Name of a FUB/variable/file.
S: The number of sets in a cache like structure.
OR
The value of a variable, for example: decode width.
A: Associativity.
B: The block size in number of bits.
b: The output size in bits.
nwl, nbl, nsp, logic, rd_mode as defined in section 4.2.1.
The power_db structure is also used to store the various filenames. The convention
used is that the first element of the database has name “root”. The next element’s name is
the configuration filename. The third element’s name is the output filename. The fourth is
the technology filename and the fifth is the technology identifier. This was found to be a
way to avoid the addition of an extra field to the database. All other elements are then
added in any order. This concludes the discussion of the important structures used. All
other structures are self-explanatory.
5.1.2 power.c
power.c contains power estimation routines and option handling routines. These routines
are described below
add_param(), get_param()
50
These functions are used to add and retrieve parameters from the power simulation
database. The former adds a structure of type power_db to the database while the latter
retrieves the same from the database.
search_opt(), print_opt()
search_opt() is used to retrieve the physical structure parameters (nwl, nbl, nsp,
logic style, read mode) on giving the option name. print_opt() prints all the
elements of the power parameter database in a tabular form. It is helpful in debugging.
dump_fub_stats()
This function dumps all the power statistics on the screen or into the specified file. The
file dump mode can be specified by mode = 0 and the screen dump by mode 0.
power_init()
This function allocates memory for all the FUB structures and calls init()on each
FUB. It also reads the thresholds specified the –global option and initializes the global
power structure.
init()
This function reads the power densities and areas of the FUBs from the basic
configuration file in case of the pfa mode. If the mode is anal, then it just calls
calc_anal(). The functions initializes all the power variables inside the structure.
Finally, it adds the FUB to the FUB database.
calc_anal(), array_power()51
These functions calculate the power numbers when in anal mode. calc_anal() calls
array_power(), which in turn calls routines from anal.c to generate the power
constants.
power_update()
All the functions mentioned before are called only at the beginning of the simulation.
This routine, however, is called every cycle to update the power variables.
power_update() multiplies the access counts to active power constants if the count is
non-zero or else uses the inactive power constants. Presently, no clock-gating feature is
incorporated, but the infrastructure has already been laid. The function also checks for
power threshold and di/dt threshold violations. At the end of the function the present
cycle power counters are reset whereas the cumulative counts keep on going.
5.1.3 anal.h
This is the header file for anal.c. It contains all the function declarations for the
functions present in anal.c.
5.1.4 anal.c
52
This file contains all the analytical models. The analytical models are described in more
detail in section 4. In this section we describe the interfaces of all the functions in
anal.c.
decoder_buffer_power()
This function takes the number of address bits and number of rows as inputs and
generates power constants for the decoder buffer. The decoder buffer is meant to feed
into all decoders needed for an array. Presently, the size of the buffer is constant,
however, in the future this can be made dependent on number of decoders that it feeds
into.
decoder_power()
This function generates the power numbers for the decoder. It takes the number of rows
and logic style as inputs.
routing_power()
This function estimates the power dissipated due the routing in the decoder. It takes rows,
columns and cell type as inputs. It needs number of columns as an input because the
decoder buffer is assumed to be at the center of all the partition as was made clear in
section 4.
wordline_power()
53
This function calculates the power for the wordline, including the wordline driver. The
wordline driver size depends upon the number of columns, which is an input and also the
particular kind of memory cell used(i.e. read mode and cell size), which is input. The size
is then calculated using the WLdriver_size() function [].
bitline_power()
This function calculates the power for the bitlines, including the precharge and isolation
transistors. It takes the number of rows, columns, cell type and read mode as inputs. In
the single ended read mode, no pre-charging is used. Instead, the bitlines are driven by
the cell transistors. Hence, this scheme can be used for relatively small structures like
register files.
senseamp_power()
This is used for calculating the sense amplifier power constants. It is assumed that the
nodes of the sense amp are charged by a separate pre-charge circuit. The inputs to this
function are the number of sense amps and the number of bitlines sharing one senseamp.
outmux_power()
This function calculates the power for the output MUX. The inputs to the function are the
numbers of inputs to the MUX and the number of outputs.
compare_power()
This function calculates the power for the comparator. This model is useful for tag arrays
and register update unit type FUBs.
54
genmux_power()
This calculates constants for a generic MUX. The inputs to the function are number of
output bits and number of bits being multiplexed into one bit.
driver_size()
This function calculates the driver size for driving a capacitance with a desired rise time.
The capacitance and rise time are inputs. The voltage swing is assumed to be from 0-
Vdd.
bldriver_size()
This is similar to driver_size() except for the fact that the voltage swing is Vsense-Vprecharge.
This function is mainly used to calculate pre-charge transistor sizes for bit lines in low –
power cache implementations.
gatecap(), gatecappass()
These functions are used to calculate the gate capacitance for a given transistor width and
poly length. The latter is used specifically for pass transistors.
draincapp(), draincapn()
These are used to calculate the drain capacitance for p and n-type transistors respectively.
The also take the number of transistors stacked as input to optimize the configuration [].
55
leakage()
This function calculates the leakage power or static power for a given transistor size with
a given threshold. Presently, it’s a very rough calculation and much more work can be
done in the future.
log2()
This function returns logarithm to the base two, rounded off to the next lowest integer. It
is mainly used for address bit calculations for a given number of rows.
5.1.5 sim-outorder.c, main.c
These files have been slightly modified for the power simulator. Following is a list of
changes made.
1 In main.c, a power option database called pow_odb has been added. This is used
in sim_print_stats() to dump the power statistics. Another change made is
the power_init() function call added after sim_init() to initialize the
power simulation.
2 In sim-outorder.c, several global variables have been added. These have
been well commented. In sim_reg_options(), the five new options have been
registered. The power_update() function call has been added in
56
sim_main(). And finally, power_database() has been added. This function
essentially processes options and adds them to the power database for use in the
analytical models.
5.2 Control Flow
The following flowchart depicts the control flow for the power simulation.
57
main.c
sim-outorder:sim_reg_options()
registers the power options into the options
database.
58
sim-outorder:power_database()
creates the power database using options read
from the configuration file and the options
database.
power.c:power_init()
power.c:init()
power.c:calc_anal()
power.c:array_power()
This completes the control flow description of the main functions in the power simulator.
59
anal.c:decoder_buffer_power()
:decoder_power()
:routing_power()
:wordline_power()
:bitline_power()
:senseamp_power()
Sim-outorder.c
power.c:power_update() every cycle
power.c:dump_fub_stats()
main.c
6 References
[1] D. Burger and T. Austin. The simplescalar tool set, version 2.0, Technical report,
Computer Sciences Department, University of Wisconsin, June 1997.
[2] S.J.E. Wilton and N.P. Jouppi An Enhanced Access and Cycle Time Model for
On-Chip Caches, Western research Laboratory Report, May 1993.
[3] D. Brooks, V. Tiwari, M. Martonosi. Wattch: A Framework for Architectural-
Level Power Analysis and Optimizations, in Proc. International Symposium on Computer
Architecture, Jun. 2000.
[4] N. Vijaykrishnan, M. Kandemir, M. J. Irwin, H. S. Kim, and W. Ye Energy-
driven integrated hardware-software optimizations using SimplePower, in Proc.
International Symposium on Computer Architecture, Jun. 2000.
[5] D. Liu and C. Svensson. Power Consumption Estimation in CMOS VLSI Chips.
IEEE Journal of Solid-State Circuits, 29(6), pp. 663-670. Jun. 1994
60
[6] P. Landman and J. Rabaey. Activity-Sensitive Architectural Power Analysis. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 15(6), page
571, Jun. 1996.
[7] R. Chen, M. Irwin, and R. Bajwa. An architectural level power estimator. In
Power-Driven Microarchitecture Workshop at ISCA25, 1998
61
Appendix
Sl.
No.
Name of the FUB Description Models
supported
1 npclog Next pc generation logic PFA
2 btblog BTB logic PFA
3 btbcac BTB cache PFA/Anal
4 itlbcac Instruction TLB PFA/Anal
5 rsbcac Return Stack Buffer PFA/Anal
6 dtlbcac Data TLB PFA/Anal
7 pmhlog Page miss handler PFA
8 il1log L1 instruction cache logic PFA
9 il1tag L1 instruction cache tag PFA/Anal
10 il1cac L1 instruction cache array PFA/Anal
11 dl1log L1 data cache logic PFA
12 dl1tag L1 data cache tag PFA/Anal
13 dl1cac L1 data cache array PFA/Anal
14 dispatchq Dispatch Queue PFA
15 decodepla Instruction decoder PFA
16 decodemisp Misprediction handling
logic
PFA
17 decodestall Decoder Stall logic PFA
18 ratarr Register Aliasing table PFA/Anal
19 ruuarr Register update unit /
reorder buffer
PFA
20 lsqarr Load/Store queue PFA
62
21 ruurdyq Re order ready queue PFA
22 lsqrdyq Load/Store ready queue PFA
23 ruuarb Re order arbitration logic PFA
24 ruuwb Re order write back
scheduler
PFA
25 lsqarb Load/store arbitration
logic
PFA
26 lsqwb Load/store write back
scheduler
PFA
27 fuint Integer functional unit PFA
28 fufp Floating point functional
unit
PFA
29 ul2log Unified L2 cache logic PFA
30 ul2tag Unified L2 cache tag PFA/Anal
31 ul2cac Unified L2 cache array PFA/Anal
32 biu Bus/IO unit PFA
33 fdlatch Fetch Decode latch PFA
34 dilatch Decode Issue Latch PFA
35 isw Instruction Issue Window PFA
Table of FUBs: Shows the various functional unit blocks with the models existing in the simulator. PFA: Power Factor Approximation
Anal: Analytical models exist
63
Sl
No.
Name of the counter Associated FUB Description
0 Brupdate BTB cache branch update activity
1 Brlookup BTB cache branch lookup activity
2 Rsbpop Return Stack Buffer return stack pop activity
3 Rsbpush Return Stack Buffer return stack push
activity
4 Il1acc L1 Instruction cac il1 access activity
5 Il1wbk L1 Instruction cac il1 writebacks activity
6 Il1rep L1 Instruction cac il1 replacements activity
7 Il1inv L1 Instruction cac il1 invalidations
activity
8 Dl1acc L1 Data cac dl1 access activity
9 Dl1wbk L1 Data cac dl1 writebacks activity
10 Dl1rep L1 Data cac dl1 replacements activity
11 Dl1inv L1 Data cac dl1 invalidations
activity
12 Il2acc L2 Instruction cac il2 access activity
13 Il2wbk L2 Instruction cac il2 writebacks activity
14 Il2rep L2 Instruction cac il2 replacements activity
15 Il2inv L2 Instruction cac il2 invalidations
activity
16 Dl2acc L2 Data cac dl2 access activity
17 Dl2wbk L2 Data cac dl2 writebacks activity
18 Dl2rep L2 Data cac dl2 replacements activity
19 Dl2inv L2 Data cac dl2 invalidations
64
activity
20 Ul2acc L2 United cache ul2 access activity
21 Ul2wbk L2 United cache ul2 writebacks activity
22 Ul2rep L2 United cache ul2 replacements activity
23 Ul2inv L2 United cache ul2 invalidations
activity
24 Itlbmis Instruction TLB itlb miss activity
25 Dtlbmis Data TLB dtlb miss activity
26 Ul2mis L2 United cache ul2 miss activity
27 Itlbacc Instruction TLB itlb access activity
28 Itlbwbk Instruction TLB itlb writebacks activity
29 Itlbrep Instruction TLB itlb replacements
activity
30 Itlbinv Instruction TLB itlb invalidations
activity
31 Dtlbacc Data TLB dtlb access activity
32 Dtlbwbk Data TLB dtlb writebacks activity
33 Dtlbrep Data TLB dtlb replacements
activity
34 Dtlbinv Data TLB dtlb invalidations
activity
35 Npc Next pc generation
logic
next pc logic activity
36 Dispatchqrd Dispatch Queue dispatchq read activity
37 Dispatchqwr Dispatch Queue dispatchq write activity
38 Dispatchqrel Dispatch Queue dispatchq release
activity
65
39 Dispatchqrec Dispatch Queue dispatchq recover
activity
40 Decoder Instruction decoder decoder activity
41 Decodemispchk Instruction decoder decoder mispredict detect
activity
42 Decodemisp Instruction decoder decoder mispredict
correction activity
43 Decodestallchk Instruction decoder decoder stall detect
activity
44 Decodestall Instruction decoder decoder stall block
activity
45 Ratidep Register Aliasing
table
rat idep allocation
activity
46 Ratodep Register Aliasing
table
rat odep allocation
activity
47 Ratstallchk Register Aliasing
table
rat stall detection
activity
48 Ratstall Register Aliasing
table
rat stall block activity
49 Ruuarr Reorder buffer ruu array activity
50 Ruurdyqsch Reorder buffer ruu readyq allocation
activity
51 Ruurec Reorder buffer ruu recover activity
52 Ruuret Reorder buffer ruu retire activity
53 Ruurdyqcam Reorder buffer ruu readyq dependence
check activity
54 Ruurdyqrel Reorder buffer ruu readyq resource
release activity
66
55 Lsqarr Load/Store queue lsq array activity
56 Lsqrdyqsch Load/Store queue lsq readyq allocation
activity
57 Lsqrec Load/Store queue lsq recover activity
58 Lsqret Load/Store queue lsq retire activity
59 Lsqrdyqcam Load/Store queue lsq readyq dependence
check activity
60 Lsqrdyqrel Load/Store queue lsq readyq resource
release activity
61 Ruuarb Reorder buffer ruu arbitration activity
62 Ruuwb Reorder buffer ruu writeback scheduler
activity
63 Ruuwbq Reorder buffer ruu writebackq activity
64 Lsqarb Load/Store queue lsq arbitration activity
65 Lsqwb Load/Store queue lsq writeback scheduler
activity
66 Lsqwbq Load/Store queue lsq writebackq activity
67 Fuint Integer point
functional unit
functional unit integer
68 Fufp Floating point
functional unit
functional unit floating
point
69 Fdlatch_active Fetch Decode latch Latch after fetch stage
active
70 Fdlatch_stall Fetch Decode latch Latch after fetch stage
stalled
71 Fdlatch_empty Fetch Decode latch Latch after fetch stage
empty
72 Dilatch_active Decode Issue Latch Latch after decode stage
67
active
73 Dilatch_stall Decode Issue Latch Latch after decode stage
stall
74 Dilatch_empty Decode Issue Latch Latch after decode stage
empty
75 Iswact Instruction Issue
Window
Issue window latch active
76 Iswstall Instruction Issue
Window
Issue window latch
stalled
77 Iswempty Instruction Issue
Window
Issue window latch empty
78 Iswcolmoved Instruction Issue
Window
Collapsible Issue window
latch moved
Table of Counters: Note that the number of counters would vary with the number of latches. If there
are three latches after the fetch stage, there would be 9 Fdlatch (69-77) counters and same for the latches
after the decode stage.
68
Index
A
active power 12, 15, 18
activity 12, 16
add_param() 17
anal 6
anal.c 1, 14, 18, 19, 25
anal.h 1, 14, 19
analytical 3, 6, 13, 14, 15, 19, 22, 25
array_power() 18
B
bitline_power() 20
bldriver_size() 21
C
calc_anal() 18
clk_a 6, 7
clk_pda 6, 7
clk_pdi 6, 7
Clock circuits 13
clock frequency 8
compare_power() 20
configuration file 5,6
control flow 1, 22
cum_power 15, 16
D
decoder_buffer_power() 19
decoder_power() 19
di/dt 6, 12, 16, 19
draincapn() 21, 25
draincapp() 21, 25
driver_size() 21
dump_fub_stats() 18
dyn_a 6, 7
dyn_pda 6, 7
dyn_pdi 6, 7
Dynamic logic 13
E
empirical 3, 6, 15
estimation 3, 12, 13, 17
F
FUB 6, 7, 12, 13, 15, 16, 17, 18
fub_t 15
G
gatecap() 21, 25
gatecappass() 21, 25
genmux_power() 21
get_param() 17
69
glb_power_t 16
global 6, 9, 18, 22
I
inactive power 12, 18
init() 18, 22
Ioh 8, 10
Iol8, 10
L
leakage() 21
Leff 8, 25
log2() 21
logic_style 7
M
main.c 1, 14, 22
max_didt 15, 16
max_didtx 15, 16
max_power 15, 16
max_powerx 15, 16
maxdidtth 6, 15, 16
maxpowerth 6, 15, 16
mem_a 6, 7
mem_pda 6, 7
mem_pdi 6, 7
Memory type regular circuits 13
methodology 1, 4, 12
mode 6, 7, 8, 13, 15, 17, 18, 20
N
nbl 7, 17, 18
nsp 7, 17, 18
NUM_POWER_COUNTERS 16
nwl 7, 17, 18
O
option database 5
Options 5
outmux_power() 20
output 5
P
pfa 6, 9, 18
physical structure 13, 18
PLA circuits 13
pla_a 6, 7
pla_pda 6, 7
pla_pdi 6, 7
pow_odb 22
power threshold 6, 19
power.c 1, 14, 15, 17
power.h 1, 14, 15
power.txt 5
power_config 5, 6
power_db 17
power_init() 6, 18, 22
power_outfile 5
power_output.txt 5
70
power_update() 18, 22
pres_count 16
prev_power 15, 16
print_opt() 18
Process Technology 1, 8
R
routing_power() 19
S
search_opt() 18
senseamp_power() 20
sim_limit 6
sim-outorder.c 1, 5, 14, 22
sta_a 6, 7
sta_pda 6, 7
sta_pdi 6, 7
Static logic 13
static_power 15, 16
T
tech 5, 8, 10, 14, 29
tech_file 5
technology 5
technology.def 5
U
unit 6, 7, 20, 25
V
Vdd 8, 10, 21
Vth 8, 10
Vtl 8, 10
W
wordline_power() 20
71
72