2eda.ee.ucla.edu/ntool/manual.doc · Web viewThe power density and area numbers are either determined empirically from the real design and scaled to the required technology or are

WArPE 1.0

Wisconsin Architecture Power Estimator

MICRO

ARCHITECTURAL

POWER ESTIMATION TOOL

1

1 Introduction 5

2 WArPE Processor Model 7

2.1 Microarchitecture 8

2.1.1 Instruction Fetch 9

2.1.2 Instruction Decode / Dispatch Stage 11

2.1.3 Instruction Execution and Writeback 14

3 Analytical Models 17

3.1 Power Density Model 17

3.2 Analytical RAM Model 19

3.2.1 Decoder Buffer 20

3.2.2 Decoder 21

3.2.3 Wordline 24

3.2.4 Bitline 25

3.2.5 Sense Amplifier 27

3.2.6 Output driver 29

3.2.7 Generic mux 30

3.2.8 Comparator 30

3.3 Latch Model 31

3.4 Special Model for Issue Window 34

4 Options, Configuration, Output 36

2

4.1 Options 36

4.2 Configuration files 37

4.2.1 Basic configuration file 38

4.2.2 Process Technology Data File 40

4.3 Output file 41

5 File Structure 46

5.1.1 power.h 47

5.1.2 power.c 50

5.1.3 anal.h 52

5.1.4 anal.c 52

5.1.5 sim-outorder.c, main.c 56

5.2 Control Flow 57

6 References 60

Appendix 62

Index 69

3

Table of figures

Figure 1: Micro architecture of a simple superscalar processor 8

Figure 2. Table of all the activity counts associated with the fetch stage 10

Figure 3: Activity Counters associated with the decode/dispatch stage 12

Figure 4: Instruction Issue Window 13

Figure 5: Activity counters associated with the execution and the writeback stage 16

Figure 6:Decoder Buffer. 21

Figure 7:Static decoder schematic 22

Figure 8:Circuits used in the two stages 23

Figure 9: Dynamic decoder. 24

Figure 10: Word line. 25

Figure 11: Bitline 26

Figure 12: Sense Amplifier architecture 27

Figure 13: Sense Amplifier circuit. 28

Figure 14: Output driver. 29

Figure 15:n-bit comparator 30

Figure 17: A Pipeline Latch 33

Figure 18: Instruction Issue Window 35

Figure 19: Basic Configuration File 43

Figure 20 Technology File. 43

Figure 21: Output File. 45

4

1 Introduction

Power consumption (and dissipation) has become critical design considerations in

modern microprocessors. For battery powered devices, such as laptop PCs and PDAs,

total power consumption is the major issue. For high performance applications such as

servers, the need to dissipate high power requires expensive packaging and cooling

technologies. Furthermore, in large-scale systems, power consumption can be a major

operating expense.

Microprocessors can be made more power efficient at a number of levels, ranging from

the circuit level, to the gate level, all the way up to software. Our particular interest is in

improving power efficiency at the microarchitecture level. For studying and developing

power efficient microarchitectures, power estimation tools are almost essential. And an

important part of our research effort has been the development of a flexible and accurate

power estimation tool –WArPE.

WArPE uses detailed microarchitecture simulation to measure energy-consuming

activities and execution time. These simulation-derived measurements can then be turned

into power estimates, given energy estimates for each of the activities. WArPE is based

on the simplescalar simulator [1], a performance simulator widely used among academic

researchers. An important element of power estimation is the energy consumed by each

of the modeled microarchitecture-level activities. In WArPE, these energy estimates can

be supplied directly by the user as empirical data, or for many important subsystems they

can be generated via analytical models that are part of WArPE.

5

Other power estimation tools based on the simplescalar simulator have been developed

[3,4]. WArPE is distinguished from these other estimators in a number of ways.

1) It can take chip technology data as an input and scale energy numbers

appropriately,

2) The instruction fetch, decode, rename, issue pipeline is modeled in detail,

including latches.

This document describes the internal structure and usage of the WArPE tool. Section 2.0

describes the detailed structure of the simulator, including estimation methodology.

Section 3.0 describes the analytical models used, and the following section contains the

options, configuration files and output file details. Section 5.0 discusses the file structure

of the simulator.

6

2 WArPE Processor Model

WArPE models a modern dynamically scheduled superscalar processor. The processor is

divided into a number of function unit blocks (FUBs). The processor is simulated in

much the same way as a performance simulator. At the end of each cycle, the estimator

determines the activity for each FUB, and uses this activity to estimate energy consumed

by that block. The total energy consumed by all the FUBs during each cycle yields an

instantaneous power estimate, and the average over all the cycles gives an average power

estimate. The instantaneous power is useful when di/dt is of concern; it can be estimated

by computing the difference in power consumption between consecutive cycles.

The per-activity energy estimates are determined before the simulator starts. These

estimates are determined in one of the following ways.

1) RAM FUBs use a general analytical model

2) Power density model for non-RAM FUBs

3) Latch models (primarily in the instruction pipeline)

4) Special models for critical FUBs such as the issue window.

The following sections describe the overall superscalar microarchitecure, including the

specific FUBs that are modeled. This is followed by descriptions of the RAM and power

density analytical models. The latch models are described along with the instruction

pipeline, and special models are described with the specific FUB is discussed.

7

2.1 Microarchitecture

In this section we touch upon some of the details of how the individual instruction

pipeline units are modeled. The generic micro architecture of a pipelined superscalar

processor is as shown in the figure.

Figure 1: Micro architecture of a simple superscalar processor

8

The associated units include the branch prediction tables, Instruction translation look

aside buffer, data caches, data translation look aside buffers, Reorder buffer, register file,

result bus etc. For most of these we have an approximate analytical model. There is no

analytical model for the latches. We now describe some details of the power models of

each pipeline stage.

2.1.1 Instruction Fetch

The instruction fetch stage involves access to the instruction cache, itlb as well as the

branch prediction logic. The FUBs representing this stage include those for new PC

generation logic (npc), logic associated with the branch target buffer access (btblog), the

actual branch target buffer RAM structure (btbcac), the return stack buffer (rsbcac), three

FUBs for the L1 instruction cache: one associated with the logic circuits to access the

cache (il1log), another one associated with the L1 tag structure (il1tag) and the third one

for the actual physical L1 instruction cache (il1cac) and the latches at the end of the

pipeline (fdlatch). WArPE has analytical models for almost all of these FUBs. Most of

these structures being Cache/CAM like have invalidate, replacement, write back, read

and write counters associated with them. Fig2 shows a list of all the counters associated

with this stage of execution.

Counter

No.

Name of the counter Description

0 Brupdate branch update activity

1 Brlookup branch lookup activity

2 Rsbpop return stack pop activity

3 Rsbpush return stack push activity

9

4 Il1acc il1 access activity

5 Il1wbk il1 writebacks activity

6 Il1rep il1 replacements activity

7 Il1inv il1 invalidations activity

12 Il2acc il2 access activity

13 Il2wbk il2 writebacks activity

14 Il2rep il2 replacements activity

15 Il2inv il2 invalidations activity

24 Itlbmis itlb miss activity

27 Itlbacc itlb access activity

28 Itlbwbk itlb writebacks activity

29 Itlbrep itlb replacements activity

30 Itlbinv itlb invalidations activity

35 Npc next pc logic activity

69 Fdlatch_active Latch after fetch stage active

70 Fdlatch_stall Latch after fetch stage stalled

71 Fdlatch_empty Latch after fetch stage empty

Figure 2. Table of all the activity counts associated with the fetch stage

In an attempt to build power numbers for these structures we try to map these tables to

an approximate Cache structure. The CACTI tools, which are used by almost all the

existing simulators, do this mapping for us. CACTI tools find an optimal cache structure

for each of these tables by taking in parameters like the cache size, associativity and the

no. of sets. The tool maps these structures to an optimal size cache assuming that some

cache optimizations would have been done at the circuit level and return an optimal

mapping. The numbers of row and column decoders are thus calculated. The power 10

models for the caches and the decoders are the same as suggested by Wilton and Jouppi

[2]. Currently, there are no analytical models for either the write back or the replacement

or the invalidation logic circuits. But the simulator maintains a count of these activities.

To calculate the power we multiply the activity counts with some approximate power

numbers as obtained from the industry. However, the user can input any numbers and

hence customize the simulator.

At the end of the fetch stage is a set of pipeline latches, which may be of variable width.

These latches may be in Active, Stalled or Empty state with each stage consuming a

different amount of energy. The simulator keeps an account of the number of latches in

each stage per cycle. This gives the power consumed each cycle by the latches. More

detail on the latch power model follows in sec 3.3.

2.1.2 Instruction Decode / Dispatch Stage

The decode stage entails the decoders as well as the register aliasing table associated with

the Register Renaming Logic. These units are represented in the simulator with FUBs for

dispatch queue (dispatchq), instruction decoder (decodepla), logic associated with

decoder for handling mispredictions (decodemisp), logic associated with stalling decoder

(decodestall), register aliasing table (ratarr), FUBs for input/output dependence check

(ratidep, ratodep), register aliasing table stall (ratstall) and the latches at the end of the

pipe stage (dilatch). There are counters associated with decoder stall and mispredict

activity as well as with the decoder access itself. The register aliasing table has counters

associated with the table itself as well as with input and output dependence checking

activity. A list of all the counters is given in fig3. Presently, we have analytical model

only for the register aliasing table cache. Rest of the activity counters are multiplied with

the power numbers obtained from the user input file (pfa mode).

11

Counter

No.

Name of the Counter Description of the counter

36 Dispatchqrd dispatchq read activity

37 Dispatchqwr dispatchq write activity

38 Dispatchqrel dispatchq release activity

39 Dispatchqrec dispatchq recover activity

40 Decoder decoder activity

41 Decodemispchk decoder mispredict detect activity

42 Decodemisp decoder mispredict correction

activity

43 Decodestallchk decoder stall detect activity

44 Decodestall decoder stall block activity

45 Ratidep rat idep allocation activity

46 Ratodep rat odep allocation activity

47 Ratstallchk rat stall detection activity

48 Ratstall rat stall block activity

72 Dilatch_active Latch after decode stage active

73 Dilatch_stall Latch after decode stage stall

74 Dilatch_empty Latch after decode stage empty

Figure 3: Activity Counters associated with the decode/dispatch stage

The instruction thus decoded are moved into another set of latches which again may be of

variable size and variable number of latches could be there. These latches may model the

delay associated with the renaming logic or the actual decoding of the instruction. As

before the latches could be in one of the three states: Active, Stalled or Empty with

different power numbers that may be the same as for the previous latches. We maintain a

12

per cycle record of the state in which the latches are (Dilatch_active,

Dilatch_stall, Dilatch_empty) and calculate the per cycle contribution to total

power.

Figure 4: Instruction Issue Window

Another innovative idea with this power simulator is in the issue window. The simulator

models both Collapsible and Non Collapsible instruction issue window with the same

FUB: isw. There would be some power associated with collapsing the instruction

window. The simulator has counter to record these movements per cycle (Iswcolmoved)

and the user can supply the power associated with these movements. The issue window

can also be viewed as a set of fixed length latches with the same three states as before.

13

The Active state (Iswact) now corresponds to the number of instruction ready to be issued

that cycle while the stalled state (Iswstall) would correspond to instruction that are still

waiting for their operands to become ready. The empty state (Iswempty) would represent

the in-occupancy of the issue window each cycle. A detailed power model for the same is

explained in sec3.4

2.1.3 Instruction Execution and Writeback

The instructions selected are then issued to the corresponding Functional Units or are

stored in the Load/Store queues. The FUBs for this stage include those for the integer

functional units (fuint), floating point functional units (fufp), the L1 data cache logic

circuit (dl1log), L1 data cache tag structure(dl1tag), L1 data cache (dl1cac) and similarly

for the united L2 cache (ul2log, ul2tag, ul2cac), the load/store queue (lsqrdyq), the data

tlb (dtlbcac). The simulator does not have an analytical model for any of the functional

units but the load/store queues can be modeled as a pair of cache like structure along with

a CAM like structure with analytical models for both of them. Another structure

associated with the execution stage is the data cache. The simulator models the data

cache on the same lines as the instruction cache using the CACTI tools. There are

counters for data cache access(dl2acc), write back(dl2wbk), replacement(dl2rep) and

invalidation(dl2inv). The data tlb is also modeled on the lines of the instruction tlb and

hence has the CAM like analytical model. The results as generated from the functional

units are broadcasted through the result bus. But the current version of the simulator

doesn’t calculate the power consumed by this result bus.

All the activities associated with the initialization and the utilization of the register update

unit are represented with the FUBs for ruu array (ruuarr), the ruu writeback (ruuwb). A

14

complete list of all the FUBS and all the counters are included in the appendix to this

manual. The list of counters associated with this stage is as follows:

Counter No. Name of the counter Description of the counter

8 Dl1acc dl1 access activity

9 Dl1wbk dl1 writebacks activity

10 Dl1rep dl1 replacements activity

11 Dl1inv dl1 invalidations activity

16 Dl2acc dl2 access activity

17 Dl2wbk dl2 writebacks activity

18 Dl2rep dl2 replacements activity

19 Dl2inv dl2 invalidations activity

20 Ul2acc ul2 access activity

21 Ul2wbk ul2 writebacks activity

22 Ul2rep ul2 replacements activity

23 Ul2inv ul2 invalidations activity

25 Dtlbmis dtlb miss activity

26 Ul2mis ul2 miss activity

31 Dtlbacc dtlb access activity

32 Dtlbwbk dtlb writebacks activity

33 Dtlbrep dtlb replacements activity

34 Dtlbinv dtlb invalidations activity

45 Ratidep rat idep allocation activity

46 Ratodep rat odep allocation activity

47 Ratstallchk rat stall detection activity

48 Ratstall rat stall block activity

49 Ruuarr ruu array activity

50 Ruurdyqsch ruu readyq allocation activity

51 Ruurec ruu recover activity

15

52 Ruuret ruu retire activity

53 Ruurdyqcam ruu readyq dependence check activity

54 Ruurdyqrel ruu readyq resource release activity

55 Lsqarr lsq array activity

56 Lsqrdyqsch lsq readyq allocation activity

57 Lsqrec lsq recover activity

58 Lsqret lsq retire activity

59 Lsqrdyqcam lsq readyq dependence check activity

60 Lsqrdyqrel lsq readyq resource release activity

61 Ruuarb ruu arbitration activity

62 Ruuwb ruu writeback scheduler activity

63 Ruuwbq ruu writebackq activity

64 Lsqarb lsq arbitration activity

65 Lsqwb lsq writeback scheduler activity

66 Lsqwbq lsq writebackq activity

67 Fuint functional unit integer

68 Fufp functional unit floating point

Figure 5: Activity counters associated with the execution and the writeback stage

16

3 Analytical Models

The architectural power estimation methodologies can be broadly classified into

empirical methods and analytical methods. These can further be classified into fixed

activity and activity sensitive methods. One of the earliest methods of power estimation

was a fixed activity method called the Power Factor Approximation method (PFA)

described by Liu and Svensson [5]. Power estimation techniques have come a long way

since then, with activity-based models, transition sensitive models and so on. The basic

estimation methodology is, however, the same. We basically either calculate the power

density constants associated with each structure as in the analytical model or take the

power constants as input from the user, pfa model.

3.1 Power Density Model

Several architectural power estimation schemes have been discussed in literature [6][7].

In WArPE we use a scheme similar to Power Factor Approximation (PFA) [5]. We

express the power dissipation in terms of the active/inactive power density of each FUB,

the area of the FUB and the activity factor, which is determined via performance

simulation.

power = {(active power density)*(activity) + (inactive power density)*(1–activity)}*area

The power density and area numbers are either determined empirically from the real

design and scaled to the required technology or are estimated by considering circuit 17

complexity, logic styles, etc. The power density numbers are further divided based on the

following circuit styles:

Dynamic logic

Static logic

PLA circuits

Memory type regular circuits

Clock circuits

Thus for every FUB, one has to define 5*3 = 15 different numbers, corresponding to

active power density, inactive power density and area for each of the five circuit styles.

The user can supply this through the configuration file. However, it is not always possible

to get/estimate these numbers. In order to overcome this problem we have included

routines, which can analytically model FUBs. Presently, we can construct models for

most regular memory type structures like caches, register files, register renaming tables,

branch target buffers and reorder buffers. The simulator is designed in such a way that

models can be updated and new models can be added relatively easily.

In order to take physical structure into consideration, a few more options have been

added. The analytical models can, and in fact will, have to be refined continuously to get

improve result accuracy. Models for other regular structures like PLAs can also be added.

18

3.2 Analytical RAM Model

In the analytical mode, power constants are generated using analytical models provided.

Presently, we have the capability to model most of the regular and simple logic based

structures. The models are based on the circuit time-delay-energy simulation model that

is similar to those used by Wilton and Jouppi [2]. The idea is to break FUBs into smaller

components, for which analytical models are present. The analytical models used in the

simulator are similar to those used by Wilton and Jouppi [2]. Some of the differences

include a choice of static vs. dynamic logic for decoder and single ended read option for

register files. These models can be used to construct power constants for FUBs that

contain regular, memory type building blocks. The FUBs that have already been modeled

are the instruction and data caches, TLBs, branch target cache, register allocation table

and return address stack. Other units that can be modeled are the register update unit and

load/store queue arrays.

For example, a cache can be divided into a decoder buffer, row decoder, word-lines, bit-

lines, sense amplifiers, column decoder and output MUXs. The models generate power

numbers by calculating the effective switching capacitance. The effective capacitance is

estimated by adding the gate, drain and routing capacitances together. These are

calculated by functions that take the width and length of Poly used, as inputs. The length

of all transistors is assumed to be constant and equal to the Leff defined in the

technology file. The list of these functions (included in anal.c) follows.

19

gatecap(): return the gate capacitance of the transistor.

gatecappass(): returns the gate capacitance for a pass transistor.

draincapp(): returns drain capacitance for the p-type transistor. It has an added feature

of optimizing for stacked transistors, example the n-type transistors in a 4-

input NAND.

draincapn(): similar function for n-type transistor.

The following sections describe each of the basic models provided. An example of the

usage of these models to create more complex models will be given in the last chapter.

3.2.1 Decoder Buffer

The decoder buffer, as the name suggests, buffers the address lines that go into the

decoders. The buffer is an important element if the address lines feed into a large number

of gates. Presently, the sizes of the buffer transistors are fixed. These could be changed

depending on the number of gates connected to the lines and the speed required. The

following figure shows the buffer architecture.

20

Figure 6:Decoder Buffer.

3.2.2 Decoder

Two types of decoder models have been included, depending on the type of circuits they

use. The first one is a static decoder that is based on a two level decoding scheme. The

first stage is constructed from 3x8 and 2x4 NAND based decoders. The second stage

21

.

.

.ADDR BITS * 2(BIT and NBIT)

VDD

GND

Decoder Buffer Single Buffer

consists of an n-input OR for every output bit, where n is the number of min terms in

stage 1. The following schematic brings out the basic architecture of this decoder.

Figure 7:Static decoder schematic

22

Stage 1 Stage 2

N decoders

3x8,2x4

using

NAND gates

N input NOR gate

Figure 8:Circuits used in the two stages

23

.

.

BIT

and NBIT

Outputs from

four

2-input

NAND gates

Structure of decoder

Eg. 2x4 decoder

Second stage NOR gate

Eg. 4-input NOR

out

The second type of decoder is the dynamic decoder, which is based on a domino NOR.

However, the maximum inputs that should be allowed for this decoder is around six. The

following figure shows a schematic of the dynamic decoder.

Figure 9: Dynamic decoder.

3.2.3 Wordline

The wordline power model includes both the wordline as well as the wordline driver. The

driver size is computed using a function called WLdriver_size(). The inputs to this

function are the capacitance driven and the rise-time expected. The rise-time has been

assumed to be period/8 due to lack of data. This can be changed by changing the entry in

tech.h. The model also takes into account single ended read type cells, used in register

files. A schematic of the wordline is shown below.

24

out

….

precharge

Figure 10: Word line.

3.2.4 Bitline

The bitline model takes into account the precharge transistors, line capacitance and

isolation transistors. Several minute features have been added and detailed comments in

the code explain these. The basic schematic of the bitline is shown below.

25

columns

Wordline

driver

Figure 11: Bitline

26

Precharge Precharge

equalizer

rows

Columns

Isolation

Pass gate

3.2.5 Sense Amplifier

The sense amplifier is shared by many bitlines using a column MUX. However, one

should not multiplex more than eight bitlines together due to leakage issues. The MUX is

a standard pass-gate based MUX with a column decoder. The basic architecture and the

sense amplifier circuit used are shown below.

colmux

Figure 12: Sense Amplifier architecture

27

Sense Amplifier

MUX MUX

BIT BITN

OUT

Figure 13: Sense Amplifier circuit.

28

BIT BITN

Vdd Vdd Vdd Vdd

Vdd VddGnd

Gnd

Gnd

3.2.6 Output driver

The output driver uses an array of tri-state drivers like the one shown in the schematic

below.

Figure 14: Output driver.

29

VDD

GND

sel

Sense amp out

out

3.2.7 Generic mux

This is a standard pass-gate based MUX. The only specifications required are the number

of inputs to be multiplexed into one bit and the number of output bits. The generic MUX,

as the name suggests, can be used to model a general MUX.

3.2.8 Comparator

The comparator design is shown in Fig. 15.

Figure 15:n-bit comparator

30

a0na0

nb0b0

prechargeVdd

out

# of bits to compare

3.3 Latch Model

At the end of the fetch stage is the pipeline latches associated with the fetch stage. These

pipeline latches are basically modeling the delay incurred between moving instruction

from the fetch stage to the decode stage. These delays could be due to the delay in BTB

lookup or in getting the branch prediction. The latches could be of variable size and the

number of latches would also vary depending upon the delay to be modeled. The variable

length of the latches is due to the fact that some information may be added on a later latch

in the pipeline. At any time these latches could be in one of the three states: Active

implying that a new instruction was moved into this latch that cycle, Stalled meaning that

the latch is holding on to the instruction that it had in the previous cycle this cycle also,

Empty meaning that the latch is not storing anything that cycle. The power associated

with each of these states would be different and is read from the input file.

31

Figure 16: Simple Architecture along with the Pipeline latches

This breakdown of energy-consuming activity allows for a form of clock gating where

active instructions may consume more energy than stalled instructions, and where valid

instructions may consume more energy than invalid ones (i.e. empty pipeline slots). For

example, consider the logic shown in Figure 17. Here, a typical pipeline latch is shown,

as might appear in the decode pipeline. An input multiplexor (typically built into the

latch) is used to "recirculate" latched pipeline values when the hold signal is active. In

addition, the valid bit from the preceding stage is used to gate the latch itself; if there is

no valid data being fed into the latch, then the latch is not clocked.

32

Figure 17: A Pipeline Latch

A Valid Bit from the previous stage is used to gate the clock signal. A hold signal from the succeeding stage is used to switch the

multiplexor and recirculate data being stalled.

In this system, a certain amount of energy is consumed if an instruction moves up the

pipeline (the hold signal is inactive) and is latched into the next stage. A different (lower)

amount is consumed if the hold signal is active, the multiplexor feeds the same data back

into the latch and the latched is clocked, but the logic following the latch does not see any

of its inputs change. Finally, a different (still lower) amount of energy is consumed if the

valid signal is off, and the latch is not clocked at all. Similarly, in the issue queue, a

particular issue queue slot may consume different amounts of energy depending on

whether or not it holds an active instruction and whether or not the instruction actually

issues. The pipeline latches were taken from a high-end design environment. A 2-to-1

static mux was used to re-circulate the data when stalled. Each cycle the simulator

maintains an account of latches in various states and the total power the latches would

consume each cycle. This is one of the innovative ideas in this simulator33

LatchMU X

clo ck

Valid from previo us stage

data indata out

hold from next stag e

3.4 Special Model for Issue Window

As stated before, the simulator models both Collapsible and Non Collapsible instruction

issue window with the same FUB: isw. There would be some power associated with

collapsing the instruction window. The simulator has counter to record these movements

per cycle (Iswcolmoved) and the user can supply the power associated with these

movements. The issue window can also be viewed as a set of fixed length latches with

the same three states as before. The Active state (Iswact) now corresponds to the number

of instruction ready to be issued that cycle while the stalled state (Iswstall) would

correspond to instruction that are still waiting for their operands to become ready. The

empty state (Iswempty) would represent the in-occupancy of the issue window each

cycle.

34

Figure 18: Instruction Issue Window

For the issue queue, wakeup logic is modeled by counting the energy in the comparators.

For the selection logic, energy of one arbiter cell was supplied. Then the number of

arbiter cells per arbiter was calculated based on the number of entries in the issue queue.

We assume one arbiter per issue port – in our case four issue ports. Every entry in the

issue queue has some comparators (for tag match). The wakeup logic associated with this

issue window involves tag comparison and has a level of XOR gates followed by NAND

gates. Assuming that the NAND gates are smaller than the XOR, the simulator records

the power consumed in these XOR gates each cycle. There are counters associated with

each of the states of the issue window latches as well as with data movement between

these latches for a collapsible window.

35

4 Options, Configuration, Output

This section describes the options, configuration files and output files used in the WArPE

power estimation tool.

4.1 Options

The estimator options (in addition to the underlying simplescalar options) are defined

below. These options have been registered in the original simplescalar option database.

Implementing these options required modification of some of the original sim-

outorder.c code.

–power_config <filename>: This option specifies the power simulator

configuration file. The file must

read permissions. The default file name is

power.txt.

–power_outfile <filename>: This option specifies the file into which output

statistics are dumped. The default file name is

power_output.txt.

36

–tech_file <filename>: This option specifies the technology definition

file name. The file must have read permissions.

The default file name is technology.def.

–technology <technology>: This option specifies the power simulation

technology. The technology is defined by an

identifier listed in the technology file.

Eg. –technology 0.25um. The default

technology is 0.8um.

–sim_limit <limit>: This option specifies the number of instructions (in

millions) at which the simulation stops and data is

dumped into the output file.

4.2 Configuration files

Following is a description of the various configuration files used in the WArPE estimator.

Configuration files provide an easy and effective way of defining the large number of

parameters used in the simulator.

37

4.2.1 Basic configuration file

This is the file defined by the –power_config option. It defines the power densities,

areas, mode of operation i.e. pfa (empirical) or anal (analytical model), power thresholds,

and physical partitioning parameters. This file can be generated by saving a Microsoft

Excel worksheet in tab delimited text format.

The file has three main option:

1) –global <max. power threshold> <max. di/dt threshold>

These define the power and di/dt thresholds for the full chip. The unit is watts.

2) <unit> <mode> <maxpowerth> <maxdidtth> <dyn_pda> <dyn_pdi> <dyn_a>

<sta_pda> <sta_pdi> <sta_a> <clk_pda> <clk_pdi> <clk_a> <mem_pda>

<mem_pdi> <mem_a> <pla_pda> <pla_pdi> <pla_a>

unit: name of the FUB (Functional Unit Block) as defined in power_init().

mode: pfa: directs the simulator to use empirical data i.e. dyn_pda,…,pla_a.

anal: directs the simulator to use analytical model for the FUB.

maxpowerth: maximum power threshold for the FUB.

maxdidtth: maximum di/dt threshold for the FUB.

dyn_pda: dynamic circuit power density - active

dyn_pdi: dynamic circuit power density - inactive

dyn_a: dynamic circuit area

sta_pda: static power density – active

sta_pdi: static circuit power density – inactive

38

sta_a: static circuit area

clk_pda: clock circuit power density – active

clk_pdi: clock power density – inactive

clk_a: clock circuit area

mem_pda: memory type circuit power density – active

mem_pdi: memory type circuit power density – inactive

mem_a: memory type circuit area

pla_pda: PLA power density – active

pla_pdi: PLA power density – inactive

pla_a: PLA circuit area

The units of the power densities are W/m2, and the units of area are m2.

3) -<unit name> <nwl> <nbl> <nsp> <logic_style> <rd mode>

Eg. –itlbcac 1 2 1 static dual

This option specifies the physical partition. In the example given above, it

defines the partition for itlb. The names specified with a “-“ followed by the FUB

name.

<nwl> : The number of partitions of the wordline. Each partition has a

different decoder and wordline driver. The partitions however

share sense amplifiers.

<nbl> : The number of partitions of the bitline. Each partition has separate

sense amplifiers and decoders.

<nsp> : Similar to bitline partition but shares decoder. 39

<logic_style> : The type of logic used for decoders, static or dynamic.

<rd mode> : Defines the read mode i.e. dual for dual rail and single for single

ended (used in small register files).

4.2.2 Process Technology Data File

This file contains the processing technology data for several generations. It must at least

contain the data for the technology defined by the –technology option. Some of the data

provided in the technology file is not used presently. It will used in later revisions, e.g.

for dual Vt technologies. The format for the technology data is as follows

<tech> <Leff> <Vdd> <f> <Vtl> <Vth> <Iol> <Ioh>

Eg. 0.8um 0.80 5.00 100 0.75 0.75 1 1

<tech>: Technology identifier. It should match the identifier supplied using the

–technology option.

<Leff>: The effective channel length in microns.

<Vdd>: The drain voltage used in the technology.

<f>: The clock frequency in MHz.

<Vtl>: For use in dual voltage circuits. This is the lower threshold voltage.

<Vth>: Higher threshold voltage.

40

<Iol>: Leakage current for the lower threshold voltage in nA/m.

<Ioh>: Leakage current for the higher threshold voltage in nA/m.

4.3 Output file

This file contains the output power statistics generated after the simulated instructions

reach sim_limit or the simulation ends. The file is well formatted and the data is self-

explanatory. Sample configuration files and output file are shown below.

-global 10 10

Npclog pfa 1 1 7.72 0.772 3.20E+046.05 0.6052.56E+05 8.43 8.43 3.20E+04 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00

Btblog pfa 1 1 7.72 0.772 0.00E+006.05 0.6052.49E+05 8.43 8.43 1.31E+04 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00

Btbcac anal 1 1 7.72 0.772 1.50E+056.05 0.6059.00E+05 8.43 8.43 1.50E+05 10.75 1.075 1.80E+06 91.75 9.175 0.00E+00

Rsbcac anal 1 1 7.72 0.772 3.85E+046.05 0.6057.70E+04 8.43 8.43 1.93E+04 10.75 1.075 5.78E+04 91.75 9.175 0.00E+00

Itlbcac anal 1 1 7.72 0.772 1.50E+056.05 0.6053.00E+05 8.43 8.43 3.75E+04 10.75 1.075 2.63E+05 91.75 9.175 0.00E+00

dtlbcac anal 1 1 7.72 0.772 1.20E+046.05 0.6054.00E+05 8.43 8.43 4.00E+04 10.75 1.075 2.40E+05 91.75 9.175 0.00E+00

pmhlog pfa 1 1 7.72 0.772 6.00E+046.05 0.6052.00E+05 8.43 8.43 2.00E+04 10.75 1.075 1.20E+05 91.75 9.175 0.00E+00

il1log pfa 1 1 7.72 0.772 2.40E+056.05 0.6051.68E+06 8.43 8.43 2.40E+05 10.75 1.075 2.40E+05 91.75 9.175 0.00E+00

il1tag anal 1 1 7.72 0.772 5.28E+056.05 0.6057.92E+05 8.43 8.43 2.64E+05 10.75 1.075 3.70E+06 91.75 9.175 0.00E+00

il1cac anal 1 1 7.72 0.772 0.00E+006.05 0.6051.32E+06 8.43 8.43 3.30E+05 10.75 1.075 4.95E+06 91.75 9.175 0.00E+00

dl1log pfa 1 1 7.72 0.772 3.60E+056.05 0.6051.68E+06 8.43 8.43 1.20E+05 10.75 1.075 2.40E+05 91.75 9.175 0.00E+00

dl1tag anal 1 1 7.72 0.772 2.64E+056.05 0.6057.92E+05 8.43 8.43 2.64E+05 10.75 1.075 3.96E+06 91.75 9.175 0.00E+00

41

dl1cac anal 1 1 7.72 0.772 0.00E+006.05 0.6051.32E+06 8.43 8.43 3.30E+05 10.75 1.075 4.95E+06 91.75 9.175 0.00E+00

dispatchq pfa 1 1 7.72 0.772 6.50E+056.05 0.6054.88E+05 8.43 8.43 1.63E+05 10.75 1.075 3.25E+05 91.75 9.175 0.00E+00

decodepla pfa 1 1 7.72 0.772 3.20E+046.05 0.6054.80E+04 8.43 8.43 1.60E+04 10.75 1.075 0.00E+00 91.75 9.175 6.40E+04

decodemisp pfa 1 1 7.72 0.772 0.00E+00 6.05 0.605 7.43E+04 8.43 8.43 8.25E+03 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00

decodestall pfa 1 1 7.72 0.772 0.00E+00 6.05 0.605 5.23E+04 8.43 8.43 2.75E+03 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00

ratarr anal 1 1 7.72 0.772 2.08E+05 6.05 0.605 5.20E+05 8.43 8.43 5.20E+04 10.75 1.075 2.60E+05 91.75 9.175 0.00E+00

ruuarr pfa 1 1 7.72 0.772 9.10E+04 6.05 0.605 1.82E+05 8.43 8.43 4.55E+04 10.75 1.075 1.37E+05 91.75 9.175 0.00E+00

lsqarr pfa 1 1 7.72 0.772 4.55E+04 6.05 0.605 9.10E+04 8.43 8.43 2.28E+04 10.75 1.075 6.83E+04 91.75 9.175 0.00E+00

ruurdyq pfa 1 1 7.72 0.772 1.50E+04 6.05 0.605 2.00E+04 8.43 8.43 2.50E+03 10.75 1.075 1.25E+04 91.75 9.175 0.00E+00

lsqrdyq pfa 1 1 7.72 0.772 7.50E+03 6.05 0.605 1.00E+04 8.43 1250 4.00E+04 10.75 1.075 6.25E+03 91.75 9.175 0.00E+00

ruuarb pfa 1 1 7.72 0.772 1.05E+05 6.05 0.605 6.30E+05 8.43 8.43 1.05E+05 10.75 1.075 2.10E+05 91.75 9.175 0.00E+00

ruuwb pfa 1 1 7.72 0.772 2.00E+05 6.05 0.605 1.20E+06 8.43 8.43 2.00E+05 10.75 1.075 4.00E+05 91.75 9.175 0.00E+00

lsqarb pfa 1 1 7.72 0.772 1.05E+05 6.05 0.605 6.30E+05 8.43 8.43 1.05E+05 10.75 1.075 2.10E+05 91.75 9.175 0.00E+00

lsqwb pfa 1 1 7.72 0.772 2.00E+05 6.05 0.605 1.20E+06 8.43 8.43 2.00E+05 10.75 1.075 4.00E+05 91.75 9.175 0.00E+00

fuint pfa 1 1 7.72 0.772 8.50E+04 6.05 0.605 2.38E+05 8.43 8.43 1.70E+04 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00

fufp pfa 1 1 7.72 0.772 1.13E+05 6.05 0.605 3.15E+05 8.43 8.43 2.25E+04 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00

ul2log pfa 1 1 7.72 0.772 1.44E+05 6.05 0.605 6.72E+05 8.43 8.43 4.80E+04 10.75 1.075 9.60E+04 91.75 9.175 0.00E+00

ul2tag anal 1 1 7.72 0.772 3.60E+05 6.05 0.605 2.88E+06 8.43 8.43 3.60E+05 10.75 1.075 3.60E+06 91.75 9.175 0.00E+00

ul2cac anal 1 1 7.72 0.772 1.50E+06 6.05 0.605 6.00E+06 8.43 8.43 0.00E+00 10.75 1.075 2.25E+07 91.75 9.175 0.00E+00

Biu pfa 1 1 7.72 0.772 5.00E+05 6.05 0.605 4.00E+06 8.43 8.43 5.00E+05 10.75 1.075 0.00E+00 91.75 9.175 0.00E+00

fdlatch_0 pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0

fdlatch_1 pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0

fdlatch_3 pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0

fdlatch_4 pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0

dilatch_0 pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0

dilatch_1 pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0

42

dilatch_2 pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0

dilatch_3 pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0

isw pfa1 1 86 34 10 0 0 0 0 0 0 0 0 0 0 0 0

-dl1cac 1 1 1 static dual -dl1tag 1 1 1 static dual

-dl2cac 1 1 1 static dual -dl2tag 1 1 1 static dual

-il1cac 1 1 1 static dual -il1tag 1 1 1 static dual

-il2cac 1 1 1 static dual -il2tag 1 1 1 static dual

-dtlbcac 1 1 1 static dual

-itlbcac 1 1 1 static dual

-btbcac 1 1 1 static dual

-regfile 1 1 1 static single

Figure 19: Basic Configuration File

tech L(um) Vdd(V) f(MHz) Vtl(V) Vth(V) Iol(nA/um) Ioh(nA/um)

0.8um 0.80 5.00 100 0.75 0.75 0.01 0.01

0.6um 0.60 3.30 200 0.65 0.65 0.01 0.01

0.35um 0.35 2.50 300 0.55 0.55 0.1 0.1

0.25um 0.25 1.50 450 0.45 0.45 0.1 0.1

0.18um 0.18 1.05 700 0.35 0.35 1 0.1

0.15um 0.15 1.00 1000 0.30 0.35 1 0.1

0.13um 0.13 1.00 1500 0.28 0.35 1 0.1

43

0.1um 0.10 0.75 2250 0.25 0.35 1 0.1

0.07um 0.70 0.60 3300 0.25 0.35 10 0.1

Figure 20 Technology File.

Sun May 19 17:07:59 2002

Power simulation checkpoint at 200000051 instructions

functional cumulative maximum maximum maximum power maximum didt

block name power power didt power violations violations

npclog 4.354e+06 8.262e+06 7.813e+06 0 0

btblog 6.775e+05 8.097e+06 7.835e+06 0 0

btbcac 1.59e+06 2.135e+07 2.092e+07 0 0

itlbcac 2.293e+05 4.446e+05 4.335e+05 0 0

rsbcac 3.414e+05 1.546e+06 1.245e+06 0 0

dtlbcac 4.024e+06 3.801e+07 3.716e+07 0 0

pmhlog 4.667e+05 3.132e+06 3.132e+06 0 0

il1log 3.548e+07 6.648e+07 6.3e+07 0 0

il1tag 1.071e+08 2.033e+08 1.962e+08 0 0

il1cac 1.062e+07 2.029e+07 1.979e+07 0 0

dl1log 1.338e+07 1.819e+08 1.628e+08 0 0

dl1tag 4.12e+07 5.679e+08 5.091e+08 0 0

dl1cac 1.485e+07 2.117e+08 1.905e+08 876705 0

dispatchq 0 0 0 0 0

decodepla 0 0 0 0 0

decodemisp 0 0 0 0 0

decodestall 0 0 0 0 0

ratarr 8.569e+07 2.715e+08 2.384e+08 0 0

ruuarr 2.734e+07 1.864e+08 1.133e+08 0 0

lsqarr 4.258e+06 2.924e+07 2.741e+07 0 0

44

ruurdyq 1.041e+06 7.845e+06 6.668e+06 0 0

lsqrdyq 7.525e+06 2.3e+07 1.464e+07 0 0

ruuarb 3.15e+07 2.795e+08 1.242e+08 0 0

ruuwb 7.137e+07 1.775e+08 1.745e+08 0 0

lsqarb 3.267e+07 2.795e+08 1.242e+08 0 0

lsqwb 2.487e+07 1.627e+08 1.597e+08 0 0

fuint 3.489e+06 8.958e+06 8.605e+06 0 0

fufp 4.671e+05 5.928e+06 5.461e+06 0 0

ul2log 1.833e+06 5.953e+07 5.85e+07 0 0

ul2tag 1.653e+07 5.574e+08 5.485e+08 0 0

ul2cac 1.352e+07 8.154e+08 8.102e+08 0 0

biu 8.242e+06 2.582e+08 2.512e+08 0 0

isw 1.625e+06 0 1.311e+06 0 0

fdlatch_0 6.458e+04 9.83e+04 7.782e+04 0 0

fdlatch_1 6.442e+04 9.83e+04 7.782e+04 0 0

fdlatch_2 6.387e+04 9.83e+04 7.782e+04 0 0

fdlatch_3 6.329e+04 9.83e+04 7.782e+04 0 0

dilatch_0 6.24e+04 9.83e+04 7.782e+04 0 0

dilatch_1 6.167e+04 9.83e+04 7.782e+04 0 0

dilatch_2 6.133e+04 9.83e+04 7.782e+04 0 0

dilatch_3 5.725e+04 9.83e+04 7.782e+04 0 0

Global statistics:

Total power = 566797441.827776

Maximum power = 3490027519.397630

Maximum didt power = 3198001037.129858

Power violations = 19489894

Didt power violations = 1204832

45

Figure 21: Output File.

5 File Structure

The simulator is essentially based on Simplescalar [1]. Care has been taken to keep the

power simulation functions in separate files thus minimizing the modification of the

original code. However, at some places it was inevitable or rather much more convenient

to modify the original Simplescalar files. The file structure is as follows.

power.c: The main power number generation file. It contains routines for power

calculation. Any new power calculation routines, eg. Clock gated

power calculation should be included in this file.

power.h: This file contains all the declarations for variables, structures and

functions and definitions used in power.c.

anal.c: Contains all the analytical models. Any new models developed should

be placed in this file.

anal.h: Contains declarations and definitions for variables and functions used

in anal.c.

tech.c: Technology processing file. Reads from the technology file and

calculates scaling factors for the required technology .The base

technology used is 0.8 um and all simulations are performed by scaling

the 0.8um technology. 46

tech.h: Contains all the device size definitions for 0.8 um base technology.

sim-outorder.c and main.c have also been modified as described later.

5.1.1 power.h

As mentioned earlier, power.c contains routines for power computation and power.h is

the supporting header file. The simulator is designed using a FUB-centric approach. All

the power numbers specific to an FUB is stored together in one structure. The structure is

shown below. Not all the elements are used. Some of them are present for future

expansion.

typedef struct {

char name[32];

double active_power;

double active_power_rd;

double active_power_wr;

double static_power;

double inactive_power;

double active_power_lt;

double stall_power_lt;

double empty_power_lt;

double active_power_cg;

double active_power_wr_cg;

47

double active_power_rd_cg;

double inactive_power_cg;

double maxpowerth

double maxdidtth;

double cum_power;

double prev_power;

double max_power;

double max_didt;

double max_powerx;

double max_didtx;

} fub_t;

The element name stores the name of the FUB, which can be at most 32 characters in

length. The next four elements store power numbers, which are obvious from their

names. It should be noted that active power comes in three flavors. When using the

empirical method, only active_power is used. It is the sum of the (power

density)*(area) products for the five different circuit styles. When analytical models are

used, the read and write operations can be separated and these give different power

consumptions thus the rd and wr suffixes. The element inactive_power is presently

redundant but can be used in the empirical mode for standby mode. The next three

numbers are power values for latches only. The next four elements are the clock gated

power numbers which are presently not being used. Notice that clock gating does not

affect static power and hence static_power_cg is not present. The elements

maxpowerth and maxdidtth are the maximum power and maximum di/dt power

thresholds for the FUB. These values are defined in the configuration file. cum_power

48

keeps accumulating the power after every cycle and is finally divided by the number of

cycles to get the average power dissipated. prev_power, max_power and

max_didt are the previous cycle power, maximum power and maximum di/dt power

respectively. Finally, max_powerx and max_didtx keep track of the number of

threshold violations.

A similar structure of type glb_power_t is used to track the full chip power numbers.

Its elements are essentially the sum of the corresponding elements of the FUB structures.

Another important structure defined is the power_t, which is used to exchange power

numbers. Its got three elements, active_power_rd, active_power_wr and

static_power which are self-explanatory.

The activity counts are tracked using two arrays of counters, one for present cycle counts

and the other for cumulative counts. Specific counters can be accessed by using the

counter name as the index, Eg. pres_count[Ruuarr]. Ninety three counters have

presently been declared. New counters can be added simply by adding their names to the

#define list and updating NUM_POWER_COUNTERS. As a convention, only the first

character of the counter name is in caps.

As more and more features are added to the simulator, new elements can be added to

these structures and new counters can be defined for more detail/functionality. This

makes the simulator amenable to future development.

49

Finally, there is a structure, which is used to maintain the power parameter database. The

structure type is called power_db. It stores the following data

name: Name of a FUB/variable/file.

S: The number of sets in a cache like structure.

OR

The value of a variable, for example: decode width.

A: Associativity.

B: The block size in number of bits.

b: The output size in bits.

nwl, nbl, nsp, logic, rd_mode as defined in section 4.2.1.

The power_db structure is also used to store the various filenames. The convention

used is that the first element of the database has name “root”. The next element’s name is

the configuration filename. The third element’s name is the output filename. The fourth is

the technology filename and the fifth is the technology identifier. This was found to be a

way to avoid the addition of an extra field to the database. All other elements are then

added in any order. This concludes the discussion of the important structures used. All

other structures are self-explanatory.

5.1.2 power.c

power.c contains power estimation routines and option handling routines. These routines

are described below

add_param(), get_param()

50

These functions are used to add and retrieve parameters from the power simulation

database. The former adds a structure of type power_db to the database while the latter

retrieves the same from the database.

search_opt(), print_opt()

search_opt() is used to retrieve the physical structure parameters (nwl, nbl, nsp,

logic style, read mode) on giving the option name. print_opt() prints all the

elements of the power parameter database in a tabular form. It is helpful in debugging.

dump_fub_stats()

This function dumps all the power statistics on the screen or into the specified file. The

file dump mode can be specified by mode = 0 and the screen dump by mode 0.

power_init()

This function allocates memory for all the FUB structures and calls init()on each

FUB. It also reads the thresholds specified the –global option and initializes the global

power structure.

init()

This function reads the power densities and areas of the FUBs from the basic

configuration file in case of the pfa mode. If the mode is anal, then it just calls

calc_anal(). The functions initializes all the power variables inside the structure.

Finally, it adds the FUB to the FUB database.

calc_anal(), array_power()51

These functions calculate the power numbers when in anal mode. calc_anal() calls

array_power(), which in turn calls routines from anal.c to generate the power

constants.

power_update()

All the functions mentioned before are called only at the beginning of the simulation.

This routine, however, is called every cycle to update the power variables.

power_update() multiplies the access counts to active power constants if the count is

non-zero or else uses the inactive power constants. Presently, no clock-gating feature is

incorporated, but the infrastructure has already been laid. The function also checks for

power threshold and di/dt threshold violations. At the end of the function the present

cycle power counters are reset whereas the cumulative counts keep on going.

5.1.3 anal.h

This is the header file for anal.c. It contains all the function declarations for the

functions present in anal.c.

5.1.4 anal.c

52

This file contains all the analytical models. The analytical models are described in more

detail in section 4. In this section we describe the interfaces of all the functions in

anal.c.

decoder_buffer_power()

This function takes the number of address bits and number of rows as inputs and

generates power constants for the decoder buffer. The decoder buffer is meant to feed

into all decoders needed for an array. Presently, the size of the buffer is constant,

however, in the future this can be made dependent on number of decoders that it feeds

into.

decoder_power()

This function generates the power numbers for the decoder. It takes the number of rows

and logic style as inputs.

routing_power()

This function estimates the power dissipated due the routing in the decoder. It takes rows,

columns and cell type as inputs. It needs number of columns as an input because the

decoder buffer is assumed to be at the center of all the partition as was made clear in

section 4.

wordline_power()

53

This function calculates the power for the wordline, including the wordline driver. The

wordline driver size depends upon the number of columns, which is an input and also the

particular kind of memory cell used(i.e. read mode and cell size), which is input. The size

is then calculated using the WLdriver_size() function [].

bitline_power()

This function calculates the power for the bitlines, including the precharge and isolation

transistors. It takes the number of rows, columns, cell type and read mode as inputs. In

the single ended read mode, no pre-charging is used. Instead, the bitlines are driven by

the cell transistors. Hence, this scheme can be used for relatively small structures like

register files.

senseamp_power()

This is used for calculating the sense amplifier power constants. It is assumed that the

nodes of the sense amp are charged by a separate pre-charge circuit. The inputs to this

function are the number of sense amps and the number of bitlines sharing one senseamp.

outmux_power()

This function calculates the power for the output MUX. The inputs to the function are the

numbers of inputs to the MUX and the number of outputs.

compare_power()

This function calculates the power for the comparator. This model is useful for tag arrays

and register update unit type FUBs.

54

genmux_power()

This calculates constants for a generic MUX. The inputs to the function are number of

output bits and number of bits being multiplexed into one bit.

driver_size()

This function calculates the driver size for driving a capacitance with a desired rise time.

The capacitance and rise time are inputs. The voltage swing is assumed to be from 0-

Vdd.

bldriver_size()

This is similar to driver_size() except for the fact that the voltage swing is Vsense-Vprecharge.

This function is mainly used to calculate pre-charge transistor sizes for bit lines in low –

power cache implementations.

gatecap(), gatecappass()

These functions are used to calculate the gate capacitance for a given transistor width and

poly length. The latter is used specifically for pass transistors.

draincapp(), draincapn()

These are used to calculate the drain capacitance for p and n-type transistors respectively.

The also take the number of transistors stacked as input to optimize the configuration [].

55

leakage()

This function calculates the leakage power or static power for a given transistor size with

a given threshold. Presently, it’s a very rough calculation and much more work can be

done in the future.

log2()

This function returns logarithm to the base two, rounded off to the next lowest integer. It

is mainly used for address bit calculations for a given number of rows.

5.1.5 sim-outorder.c, main.c

These files have been slightly modified for the power simulator. Following is a list of

changes made.

1 In main.c, a power option database called pow_odb has been added. This is used

in sim_print_stats() to dump the power statistics. Another change made is

the power_init() function call added after sim_init() to initialize the

power simulation.

2 In sim-outorder.c, several global variables have been added. These have

been well commented. In sim_reg_options(), the five new options have been

registered. The power_update() function call has been added in

56

sim_main(). And finally, power_database() has been added. This function

essentially processes options and adds them to the power database for use in the

analytical models.

5.2 Control Flow

The following flowchart depicts the control flow for the power simulation.

57

main.c

sim-outorder:sim_reg_options()

registers the power options into the options

database.

58

sim-outorder:power_database()

creates the power database using options read

from the configuration file and the options

database.

power.c:power_init()

power.c:init()

power.c:calc_anal()

power.c:array_power()

This completes the control flow description of the main functions in the power simulator.

59

anal.c:decoder_buffer_power()

:decoder_power()

:routing_power()

:wordline_power()

:bitline_power()

:senseamp_power()

Sim-outorder.c

power.c:power_update() every cycle

power.c:dump_fub_stats()

main.c

6 References

[1] D. Burger and T. Austin. The simplescalar tool set, version 2.0, Technical report,

Computer Sciences Department, University of Wisconsin, June 1997.

[2] S.J.E. Wilton and N.P. Jouppi An Enhanced Access and Cycle Time Model for

On-Chip Caches, Western research Laboratory Report, May 1993.

[3] D. Brooks, V. Tiwari, M. Martonosi. Wattch: A Framework for Architectural-

Level Power Analysis and Optimizations, in Proc. International Symposium on Computer

Architecture, Jun. 2000.

[4] N. Vijaykrishnan, M. Kandemir, M. J. Irwin, H. S. Kim, and W. Ye Energy-

driven integrated hardware-software optimizations using SimplePower, in Proc.

International Symposium on Computer Architecture, Jun. 2000.

[5] D. Liu and C. Svensson. Power Consumption Estimation in CMOS VLSI Chips.

IEEE Journal of Solid-State Circuits, 29(6), pp. 663-670. Jun. 1994

60

[6] P. Landman and J. Rabaey. Activity-Sensitive Architectural Power Analysis. IEEE

Transactions on Computer-Aided Design of Integrated Circuits and Systems, 15(6), page

571, Jun. 1996.

[7] R. Chen, M. Irwin, and R. Bajwa. An architectural level power estimator. In

Power-Driven Microarchitecture Workshop at ISCA25, 1998

61

Appendix

Sl.

No.

Name of the FUB Description Models

supported

1 npclog Next pc generation logic PFA

2 btblog BTB logic PFA

3 btbcac BTB cache PFA/Anal

4 itlbcac Instruction TLB PFA/Anal

5 rsbcac Return Stack Buffer PFA/Anal

6 dtlbcac Data TLB PFA/Anal

7 pmhlog Page miss handler PFA

8 il1log L1 instruction cache logic PFA

9 il1tag L1 instruction cache tag PFA/Anal

10 il1cac L1 instruction cache array PFA/Anal

11 dl1log L1 data cache logic PFA

12 dl1tag L1 data cache tag PFA/Anal

13 dl1cac L1 data cache array PFA/Anal

14 dispatchq Dispatch Queue PFA

15 decodepla Instruction decoder PFA

16 decodemisp Misprediction handling

logic

PFA

17 decodestall Decoder Stall logic PFA

18 ratarr Register Aliasing table PFA/Anal

19 ruuarr Register update unit /

reorder buffer

PFA

20 lsqarr Load/Store queue PFA

62

21 ruurdyq Re order ready queue PFA

22 lsqrdyq Load/Store ready queue PFA

23 ruuarb Re order arbitration logic PFA

24 ruuwb Re order write back

scheduler

PFA

25 lsqarb Load/store arbitration

logic

PFA

26 lsqwb Load/store write back

scheduler

PFA

27 fuint Integer functional unit PFA

28 fufp Floating point functional

unit

PFA

29 ul2log Unified L2 cache logic PFA

30 ul2tag Unified L2 cache tag PFA/Anal

31 ul2cac Unified L2 cache array PFA/Anal

32 biu Bus/IO unit PFA

33 fdlatch Fetch Decode latch PFA

34 dilatch Decode Issue Latch PFA

35 isw Instruction Issue Window PFA

Table of FUBs: Shows the various functional unit blocks with the models existing in the simulator. PFA: Power Factor Approximation

Anal: Analytical models exist

63

Sl

No.

Name of the counter Associated FUB Description

0 Brupdate BTB cache branch update activity

1 Brlookup BTB cache branch lookup activity

2 Rsbpop Return Stack Buffer return stack pop activity

3 Rsbpush Return Stack Buffer return stack push

activity

4 Il1acc L1 Instruction cac il1 access activity

5 Il1wbk L1 Instruction cac il1 writebacks activity

6 Il1rep L1 Instruction cac il1 replacements activity

7 Il1inv L1 Instruction cac il1 invalidations

activity

8 Dl1acc L1 Data cac dl1 access activity

9 Dl1wbk L1 Data cac dl1 writebacks activity

10 Dl1rep L1 Data cac dl1 replacements activity

11 Dl1inv L1 Data cac dl1 invalidations

activity

12 Il2acc L2 Instruction cac il2 access activity

13 Il2wbk L2 Instruction cac il2 writebacks activity

14 Il2rep L2 Instruction cac il2 replacements activity

15 Il2inv L2 Instruction cac il2 invalidations

activity

16 Dl2acc L2 Data cac dl2 access activity

17 Dl2wbk L2 Data cac dl2 writebacks activity

18 Dl2rep L2 Data cac dl2 replacements activity

19 Dl2inv L2 Data cac dl2 invalidations

64

activity

20 Ul2acc L2 United cache ul2 access activity

21 Ul2wbk L2 United cache ul2 writebacks activity

22 Ul2rep L2 United cache ul2 replacements activity

23 Ul2inv L2 United cache ul2 invalidations

activity

24 Itlbmis Instruction TLB itlb miss activity

25 Dtlbmis Data TLB dtlb miss activity

26 Ul2mis L2 United cache ul2 miss activity

27 Itlbacc Instruction TLB itlb access activity

28 Itlbwbk Instruction TLB itlb writebacks activity

29 Itlbrep Instruction TLB itlb replacements

activity

30 Itlbinv Instruction TLB itlb invalidations

activity

31 Dtlbacc Data TLB dtlb access activity

32 Dtlbwbk Data TLB dtlb writebacks activity

33 Dtlbrep Data TLB dtlb replacements

activity

34 Dtlbinv Data TLB dtlb invalidations

activity

35 Npc Next pc generation

logic

next pc logic activity

36 Dispatchqrd Dispatch Queue dispatchq read activity

37 Dispatchqwr Dispatch Queue dispatchq write activity

38 Dispatchqrel Dispatch Queue dispatchq release

activity

65

39 Dispatchqrec Dispatch Queue dispatchq recover

activity

40 Decoder Instruction decoder decoder activity

41 Decodemispchk Instruction decoder decoder mispredict detect

activity

42 Decodemisp Instruction decoder decoder mispredict

correction activity

43 Decodestallchk Instruction decoder decoder stall detect

activity

44 Decodestall Instruction decoder decoder stall block

activity

45 Ratidep Register Aliasing

table

rat idep allocation

activity

46 Ratodep Register Aliasing

table

rat odep allocation

activity

47 Ratstallchk Register Aliasing

table

rat stall detection

activity

48 Ratstall Register Aliasing

table

rat stall block activity

49 Ruuarr Reorder buffer ruu array activity

50 Ruurdyqsch Reorder buffer ruu readyq allocation

activity

51 Ruurec Reorder buffer ruu recover activity

52 Ruuret Reorder buffer ruu retire activity

53 Ruurdyqcam Reorder buffer ruu readyq dependence

check activity

54 Ruurdyqrel Reorder buffer ruu readyq resource

release activity

66

55 Lsqarr Load/Store queue lsq array activity

56 Lsqrdyqsch Load/Store queue lsq readyq allocation

activity

57 Lsqrec Load/Store queue lsq recover activity

58 Lsqret Load/Store queue lsq retire activity

59 Lsqrdyqcam Load/Store queue lsq readyq dependence

check activity

60 Lsqrdyqrel Load/Store queue lsq readyq resource

release activity

61 Ruuarb Reorder buffer ruu arbitration activity

62 Ruuwb Reorder buffer ruu writeback scheduler

activity

63 Ruuwbq Reorder buffer ruu writebackq activity

64 Lsqarb Load/Store queue lsq arbitration activity

65 Lsqwb Load/Store queue lsq writeback scheduler

activity

66 Lsqwbq Load/Store queue lsq writebackq activity

67 Fuint Integer point

functional unit

functional unit integer

68 Fufp Floating point

functional unit

functional unit floating

point

69 Fdlatch_active Fetch Decode latch Latch after fetch stage

active

70 Fdlatch_stall Fetch Decode latch Latch after fetch stage

stalled

71 Fdlatch_empty Fetch Decode latch Latch after fetch stage

empty

72 Dilatch_active Decode Issue Latch Latch after decode stage

67

active

73 Dilatch_stall Decode Issue Latch Latch after decode stage

stall

74 Dilatch_empty Decode Issue Latch Latch after decode stage

empty

75 Iswact Instruction Issue

Window

Issue window latch active

76 Iswstall Instruction Issue

Window

Issue window latch

stalled

77 Iswempty Instruction Issue

Window

Issue window latch empty

78 Iswcolmoved Instruction Issue

Window

Collapsible Issue window

latch moved

Table of Counters: Note that the number of counters would vary with the number of latches. If there

are three latches after the fetch stage, there would be 9 Fdlatch (69-77) counters and same for the latches

after the decode stage.

68

Index

A

active power 12, 15, 18

activity 12, 16

add_param() 17

anal 6

anal.c 1, 14, 18, 19, 25

anal.h 1, 14, 19

analytical 3, 6, 13, 14, 15, 19, 22, 25

array_power() 18

B

bitline_power() 20

bldriver_size() 21

C

calc_anal() 18

clk_a 6, 7

clk_pda 6, 7

clk_pdi 6, 7

Clock circuits 13

clock frequency 8

compare_power() 20

configuration file 5,6

control flow 1, 22

cum_power 15, 16

D

decoder_buffer_power() 19

decoder_power() 19

di/dt 6, 12, 16, 19

draincapn() 21, 25

draincapp() 21, 25

driver_size() 21

dump_fub_stats() 18

dyn_a 6, 7

dyn_pda 6, 7

dyn_pdi 6, 7

Dynamic logic 13

E

empirical 3, 6, 15

estimation 3, 12, 13, 17

F

FUB 6, 7, 12, 13, 15, 16, 17, 18

fub_t 15

G

gatecap() 21, 25

gatecappass() 21, 25

genmux_power() 21

get_param() 17

69

glb_power_t 16

global 6, 9, 18, 22

I

inactive power 12, 18

init() 18, 22

Ioh 8, 10

Iol8, 10

L

leakage() 21

Leff 8, 25

log2() 21

logic_style 7

M

main.c 1, 14, 22

max_didt 15, 16

max_didtx 15, 16

max_power 15, 16

max_powerx 15, 16

maxdidtth 6, 15, 16

maxpowerth 6, 15, 16

mem_a 6, 7

mem_pda 6, 7

mem_pdi 6, 7

Memory type regular circuits 13

methodology 1, 4, 12

mode 6, 7, 8, 13, 15, 17, 18, 20

N

nbl 7, 17, 18

nsp 7, 17, 18

NUM_POWER_COUNTERS 16

nwl 7, 17, 18

O

option database 5

Options 5

outmux_power() 20

output 5

P

pfa 6, 9, 18

physical structure 13, 18

PLA circuits 13

pla_a 6, 7

pla_pda 6, 7

pla_pdi 6, 7

pow_odb 22

power threshold 6, 19

power.c 1, 14, 15, 17

power.h 1, 14, 15

power.txt 5

power_config 5, 6

power_db 17

power_init() 6, 18, 22

power_outfile 5

power_output.txt 5

70

power_update() 18, 22

pres_count 16

prev_power 15, 16

print_opt() 18

Process Technology 1, 8

R

routing_power() 19

S

search_opt() 18

senseamp_power() 20

sim_limit 6

sim-outorder.c 1, 5, 14, 22

sta_a 6, 7

sta_pda 6, 7

sta_pdi 6, 7

Static logic 13

static_power 15, 16

T

tech 5, 8, 10, 14, 29

tech_file 5

technology 5

technology.def 5

U

unit 6, 7, 20, 25

V

Vdd 8, 10, 21

Vth 8, 10

Vtl 8, 10

W

wordline_power() 20

71

72

Documents

2eda.ee.ucla.edu/ntool/manual.doc · Web viewThe power density and area numbers are either determined empirically from the real design and scaled to the required technology or are