Chip design stylestcdungnghi/CSD/CSD_c7_IC_design_ln4.pdf‣Field Programmable Gate Arrays (FPGA) ... The energy-delay analysis is extended to include area comparison and quantify

Integrated Circuit Designp

Truong Cong Dung Nghi

Chip design styles• Full-custom‣ Transistors are hand-drawn‣ Best performance (although almost extinct) ‣ Recent processors are semi-custom (Sun, AMD, Intel)

• Standard-Cell-based ASICs‣ Only use standard cells from the library‣ Dominant design style for non-processor, comms and multimedia ASICs

• Cheaper alternatives (for small volumes)‣ Sea of Gates (mask-programmable gate arrays)‣ Field Programmable Gate Arrays (FPGA)- On-the fly reconfigurable interconnect

• Flexibility vs. cost‣ Tighter control over transistors increases design cost- Can make faster designs but harder to verify and more expensive

2

p


Modern Application-Specific IC (ASIC)

• Multiple functional blocks‣ Put together at the top level‣ Lots of modeling- Behavioral level- Architecture/functional level

‣ Different teams on each block: need to make sure things work‣ Many levels of hierarchy: lots of iteration and reuse

• Many architectural choices‣ Use lots of implementation tricks- Micro-architecture and algorithmic transforms- Straightforward solutions many times the chip size

3

p


Gate Arrays• Can cut mask costs by prefabricating arrays of transistors on wafers.

• Only customize metal layer for each design:‣ Fixed-size unit transistors

‣ Metal connections personalize design

• Two kinds‣ Channeled Gate Arrays: leave space between rows of transistor for routing‣ Sea-of-Gates: route over the top of unused transistors.

4

p


Gate Array - Pros and Cons• Cheaper and quicker since less masks to make‣ Can stockpile wafers with diffusion and poly finished

• Memory inefficient when made from gate array‣ Embedded gate arrays add multiple fixed memory blocks to improve density

(=> Structured ASICs)‣ Cell-based array designed to provide efficient memory cell (6 transistors in

basic cell)

• Logic slow and big due to fixed transistors and wiring overhead‣ Advanced cell-based arrays hardwire logic functions (NANDs/NORs/LUTs)

which are personalized with metal

5

p


Field-Programmable Gate Arrays (FPGA)

• Each cell in array contains a programmable logic function

• Array has programmable interconnect between logic functions

• Arrays mass-produced and programmed by customer after fabrication‣ Can be programmed by blowing fuses, loading SRAM bits, or loading

FLASH memory

• Overhead of programmability makes arrays expensive and slow but startup costs are low, so much cheaper than ASIC for small volumes

6

p


Standard cell ASICs• Also called Cell-Based ICs (CBICs)

• Fixed library of cells

• Design‣ Cells synthesized from hardware descriptive language (Verilog, VHDL)‣ Cells manually entered in a schematics‣ Placed and routed automatically (most desirable)

• Full set of masks for productions‣ Most popular today, but increasingly expensive due to mask costs in

advanced technologies‣ Currently a mask set is a couple $M- FPGAs are getting increased attention

7

p


ASIC flow

8

The ASIC flow Design Capture

Pre-Layout Simulation

Loaic Svnthesis

! Floordannina

3 1 1 1 1 U I Q C I U I I Placement I Physical: I . -- . I

Circuit RoutingExtraction .c

Tapeout Most Common DesignApproach for Designs up to 500Mhz

Clock Rates Courtesy of Anantha Chandrakasan. Used with permission.

Cite as: Vladimir Stojanovic, course materials for 6.973 Communication System Design, Spring 2006. MIT OpenCourseWare (http://ocw.mit.edu/), Massachusetts Institute of Technology.

Downloaded on [DD Month WYY]. 6.973 Communication System Design

Most Common Design Approach for Designs up to 500Mhz Clock Rates

Architectural Techniques

9

p


Micro-architectures and transformations

• There are a large number of implementations of the same functionality

• These implementations present a different point in the area-time-power design space

• Behavioral transformations allow exploring the design space a high-level

• Optimization metrics:‣ Area of the design ‣ Throughput or sample time T

‣ Latency: clock cycles between the input and associated output change ‣ Power consumption ‣ Energy of executing a task

10

p


Basic micro-architectural techniques

• Parallelism, pipelining: used for power reduction.

• Time-multiplexing: used for area reduction.

11

47

Slide 3.1

This chapter discusses architectural techniques for area and energy reduction in chips for digital signal processing. Parallelism, time-multiplexing, pipelining, interleaving and folding are compared in the energy-delay space of pipeline logic as a systematic way to evaluate different architectural options. The energy-delay analysis is extended to include area comparison and quantify time-space tradeoffs. So, the energy-area-performance representation will serve as a basis for evaluating

multiple architectural techniques. It will also give insight into which architectural transformations need to be made to track scaling of the underlying technology for the most cost- and energy-efficient solutions.

Slide 3.2

Three basic architectural techniques are parallelism, pipelining, and time-multiplexing. Figure (a) shows the reference datapath with logic blocks A and B between pipeline registers. Starting from the reference architecture, we can make transformations into parallel, pipelined, or time-multiplexed designs without affecting the data throughput.

Parallelism and pipelining are used for power reduction. We could introduce parallelism as shown in Figure (b). This is

accomplished by replicating the input register and the datapath, and adding a multiplexer before the output register. Parallelism trades increased area (blocks shaded in gray) for reduced speed of the pipeline logic (A and B are running at half the original speed), which allows for supply voltage reduction to decrease power. Another option is to pipeline the blocks, as shown in Figure (c). An extra pipeline register is inserted between logic blocks A and B. This lets blocks A and B run at half the speed and also allows for supply voltage reduction, which leads to a decrease in power. The logic depth is reduced at the cost of increased latency.

Time-multiplexing is used for area reduction, as shown in Figures (d) and (e). The reference case shown in Figure (d) has two blocks to execute two operations of the same kind. An alternative is to

Basic Micro-Architectural Techniques

Parallelism, pipelining, time-multiplexing

2f

2f

(a) reference

(c) pipeline (b) parallel

A B

A B B

BA

A

f f

f f f

f

Af

(d) reference for time-mux (e) time-multiplex

f

Af f

Af f

f f2f 2f

3.2

with Borivoje NikolićUniversity of California, Berkeley

Architectural Techniques

Chapter 3

p


Starting Design: Reference Datapath

• Critical-path delay = tadder + tcomparator (=25ns)‣ Total capacitance being switched = Cref

‣ VDD = VDD,ref = 5V‣ Power for reference datapath = Pref = fref . Cref . VDD,ref 2

12

48 Chapter 3

do the same by using a single hardware module, but at twice the reference speed, to process the two data streams as shown in Figure (e). If the area of the multiplexer and demultiplexer is less than the area of block A, time-multiplexing results in an area reduction. This is most commonly the case since datapath logic is much more complex than a de/multiplexing operation.

Slide 3.3

To establish a baseline for architecture study, let’s consider a reference datapath which includes an adder that adds two operands A and B, followed by a comparator that compares the sum to a third operand C [1]. The critical-path delay here is tadder + tcomparator, which is equal to 25 ns, for example (this is a delay for the 0.6-µm technology used in [1]). Let Cref be the total capacitance switched at a reference voltage VDD,ref = 5V. The switching power for the reference datapath is fref · Cref · VDD,ref

2. Now, let’s see what kind of tradeoffs can be made by restructuring the datapath by using parallelism.

Slide 3.4

Parallelism is now employed on the previous design. The two modules now process odd and even samples of the incoming data, which are then combined by the output multiplexer. The datapath is replicated and routed to the output multiplexer. The clock rate for each of the blocks is reduced by half to maintain the same throughput. Since the cycle time is reduced, supply voltage can also be reduced by a factor of 1.7. The total capacitance is now 2.15Cref including the overhead capacitance

of 0.15Cref. The total switching power, however, is (fref/2) · 2.15Cref · (VDD,ref/1.7)2 ≈ 0.36 Pref. Therefore, over 60% of power reduction is achieved without performance penalty by using a parallel-2 architecture.

Parallel Datapath Architecture

The clock rate of a parallel datapath can be reduced by half with the same throughput as the reference datapath → fpar = fref/2– VDD,par = VDD,ref /1.7, Cpar = 2.15 · Cref– Ppar = (fref /2) · (2.15 · Cref) · (VDD,ref/1.7)2 ≈ 0.36 ·Pref

3.4

fclk /2ABC

fclk /2ABC

Z

fclk


Critical-path delay → tadder + tcomparator (= 25 ns) [1]

– Total capacitance being switched = Cref– VDD = VDD,ref = 5 V– Power for reference datapath = Pref = fref · Cref · VDD,ref

2

3.3

[1] A.P. Chandrakasan, S. Sheng, and R.W. Brodersen, “Low-Power CMOS Digital Design,” IEEE J. Solid-State Circuits, vol. 27, no. 4, pp. 473-484, Apr. 1992.

Add

REG

fclk

A

Z

REGB

REGC

A>B?

p



• Parallelism is employed on the previous design.

• The two modules process odd and even samples of the incoming data, which are then combined by the output multiplexer.

• The datapath is replicated and routed to the output multiplexer.

• The clock rate of a parallel datapath can be reduced by half with the same throughput as the reference datapath → fpar = fref /2‣ VDD,par = VDD,ref /1.7, Cpar = 2.15 · Cref

‣ Ppar = (fref /2) · (2.15 · Cref) · (VDD,ref /1.7)2 ≈ 0.36 · Pref

13

48 Chapter 3

do the same by using a single hardware module, but at twice the reference speed, to process the two data streams as shown in Figure (e). If the area of the multiplexer and demultiplexer is less than the area of block A, time-multiplexing results in an area reduction. This is most commonly the case since datapath logic is much more complex than a de/multiplexing operation.

Slide 3.3

To establish a baseline for architecture study, let’s consider a reference datapath which includes an adder that adds two operands A and B, followed by a comparator that compares the sum to a third operand C [1]. The critical-path delay here is tadder + tcomparator, which is equal to 25 ns, for example (this is a delay for the 0.6-µm technology used in [1]). Let Cref be the total capacitance switched at a reference voltage VDD,ref = 5V. The switching power for the reference datapath is fref · Cref · VDD,ref

2. Now, let’s see what kind of tradeoffs can be made by restructuring the datapath by using parallelism.

Slide 3.4

Parallelism is now employed on the previous design. The two modules now process odd and even samples of the incoming data, which are then combined by the output multiplexer. The datapath is replicated and routed to the output multiplexer. The clock rate for each of the blocks is reduced by half to maintain the same throughput. Since the cycle time is reduced, supply voltage can also be reduced by a factor of 1.7. The total capacitance is now 2.15Cref including the overhead capacitance

of 0.15Cref. The total switching power, however, is (fref/2) · 2.15Cref · (VDD,ref/1.7)2 ≈ 0.36 Pref. Therefore, over 60% of power reduction is achieved without performance penalty by using a parallel-2 architecture.


The clock rate of a parallel datapath can be reduced by half with the same throughput as the reference datapath → fpar = fref/2– VDD,par = VDD,ref /1.7, Cpar = 2.15 · Cref– Ppar = (fref /2) · (2.15 · Cref) · (VDD,ref/1.7)2 ≈ 0.36 ·Pref

3.4

fclk /2ABC

fclk /2ABC

Z

fclk


Critical-path delay → tadder + tcomparator (= 25 ns) [1]

– Total capacitance being switched = Cref– VDD = VDD,ref = 5 V– Power for reference datapath = Pref = fref · Cref · VDD,ref

2

3.3

[1] A.P. Chandrakasan, S. Sheng, and R.W. Brodersen, “Low-Power CMOS Digital Design,” IEEE J. Solid-State Circuits, vol. 27, no. 4, pp. 473-484, Apr. 1992.

Add

REG

fclk

A

Z

REGB

REGC

A>B?

p



14

Architectural Techniques 49

Slide 3.5

It should, however, be kept in mind that parallelism adds latency, as illustrated in this slide. The top half of the slide shows the incoming data stream A and the corresponding output Z. The design has a single-cycle latency, so samples of Z are delayed by one clock cycle with respect to samples of A. The parallel-2 design takes samples of A and propagates resulting samples of Z after two clock cycles. Therefore, one extra cycle of latency is introduced. Parallelism is, therefore, possible if

extra latency is allowed. This is the case in most low-power applications. Besides energy reduction, parallelism can also be used to improve performance; if the clock rate for each of the blocks were kept at the original rate, for example, then the throughput would increase by twofold.

Slide 3.6

The plot in this slide shows energy per operation (Eop) versus throughput for architectures with varying degrees of parallelism [2]. We see that parallelism can be used to reduce power at the same frequency or increase the frequency for the same energy. Squares show the minimum energy-delay product (EDP) point for each of the architectures. The results indicate that as the amount of parallelism increases, the minimum EDP point corresponds to a higher throughput and also a higher energy. With an

increasing amount of parallelism, more energy is needed to support the overhead. Also, leakage energy has more impact with increasing parallelism.

The plot also shows the increased performance range of micro-architectural tuning as compared to circuit-level optimization. The reference datapath (min-delay sizing at reference voltage) is shown as the black dot in the energy-delay space. This point is also used as reference for energy. The reference design shows the tunability at the circuit level, which is limited in performance to ±30% around the reference point as discussed in Chapter 2. Clearly, parallelism goes much beyond ±30% in extending performance while keeping the energy consumption low. However, the area and

Parallelism Adds Latency

A1 A2 A3 A4 A5

A1

A2

A3

A4

A5

Z1

Z1 Z2 Z3 Z4 Z5

Z2 Z3 Z4 Z5

Level of parallelism P = 2

time

3.5

Add

REG

fclk

AZ

B

REG

fclk /2AB

fclk /2AB

Z

fclk

Increasing Level of Parallelism

Area: AP ≈ P ·∙ Aref

Throughput (1/FO4)

E op

(no

rm.)

Parallelism

Improves throughput for the same energy

Improves energy for the same throughput

Cost: increased area

Is having more parallelism always better?

3.6

0 0.1 0.2 0.3 0.4 0.50

0.5

1

1.5

2

ref 2 3 4 5

Increasingparallelism

[2]

[2] D. Marković et al., “Methods for True Energy-Performance Optimization,” IEEE J. Solid-State Circuits, vol. 39, no. 8, pp. 1282-1293, Aug. 2004.

p


Pipelined Datapath Architecture

• Critical-path delay is less → max (tadder , tcomparator)‣ Keeping clock rate constant: fpipe = fref

‣ Voltage can be dropped → VDD,pipe = VDD,ref /1.7‣ Capacitance slightly higher: Cpipe = 1.15 · Cref

‣ Ppipe = fref · (1.15 · Cref) · (VDD,ref /1.7)2 ≈ 0.39 · Pref

15

50 Chapter 3

More Parallelism is Not Always Better

Supply voltage, VDD

To

ta

l E

ne

rg

y

Reference

Parallel

Leakage and overhead start to dominate at high levels of

parallelism, causing minimum energy (dot) to increase

Optimum voltage also increases with parallelism

3.7

·tot Sw Lk overheadE E N E E

[J. M. Rabaey, UCB]

performance benefits of parallelism come at the cost of increased area, which imposes a practical (cost) limit on the amount of parallelism.

Slide 3.7

Area is not the only limiting factor to increasing the level of parallelism. Total energy also increases at some point due to the large impact of leakage. Let’s look at the E-D tradeoff curve to explain this effect. The total energy we expend in a parallel design is the sum of switching energy of the active block, the leakage energy of all the parallel blocks and the energy incurred in the multiplexing. Even if we neglect the multiplexing overhead, (which is a fair assumption for complex blocks), the increase in leakage

energy at high levels of parallelism would imply that the minimum-energy point now shifts to a higher voltage. The total energy to perform one operation with higher levels of parallelism is thus higher, since the increased leakage and cost of overhead pull the energy per operation to a higher value.

Slide 3.8

Let us now use the same example to look into pipelining. We introduce latency via a set of pipeline registers to cut down the critical path of tadder + tcomparator to the maximum of (tadder, tcomparator). The clock frequency is the same as the original design. Given the reduced critical path, the voltage can now be reduced by a factor of 1.7 (same as in the parallelism example on Slide 3.4). The capacitance is slightly higher than the reference capacitance to account for the switching energy dissipated in the

pipeline registers. The switching power of the pipelined design is thus given by fref · (1.15Cref)-

· (VDD,ref/1.7)2 ≈ 0.39Pref. Pipelining, therefore, can also be used to reduce the overall power consumption.

Pipelined Datapath Architecture

Critical-path delay is less →max (tadder , tcomparator)

– Keeping clock rate constant: fpipe = fref– Voltage can be dropped → VDD,pipe = VDD,ref/1.7– Capacitance slightly higher: Cpipe = 1.15 ·Cref– Ppipe = fref · (1.15 ·Cref) · (VDD,ref/1.7)2 ≈ 0.39 ·Pref

3.8

Add

RE

G

fclk

A

Z

RE

GB

RE

GC

A>B?

RE

GR

EG

p


Parallelism and Pipelining in E-D Space

16

54 Chapter 3

Slide 3.14

Let us further analyze parallelism and pipelining using the energy-delay space for pipeline logic. The block diagrams show reference, parallel and pipeline implementations of a logic function. Parallelism simply slows down the computation by introducing redundancy in area. It directs operands to parallel branches in an interleaved fashion and later combines them at the output. Pipelining introduces an extra pipeline registers to relax timing constraints on blocks A and B so

that we can then scale the voltage to reduce energy.

The energy-delay behavior of logic blocks during these transformations is very important to understand, because pipeline logic is a fundamental micro-architectural component. From a datapath logic standpoint, parallelism and pipelining are equivalent to moving the reference energy-delay point toward reduced Energy/Op and increased Time/Op on the energy-delay curve. Clearly, the energy saving would be the largest when the reference design is in the steep-slope region of the energy-delay curve. This corresponds to the minimum-delay design (which explains why parallelism and pipelining were so effective in previous examples). It is also possible to reduce energy when the performance requirement on the reference design is beyond its minimum achievable delay as illustrated on Slide 3.6. In that case, parallelism would be needed to meet the performance first, followed by the use of voltage scaling to reduce energy. How is the energy minimized?

Slide 3.15

Minimum energy is achieved when the leakage energy is about one half of the switching energy, ELk/ESw ≈ 0.5. This ratio can be analytically derived, as shown in the box [4]. The equation states that the optimal ELk/ESw depends on the logic depth (LD), switching activity (α), and the technology process (parameter K). The minimum is quite broad, due to the logarithmic formula, with total energy being within about 10% of the minimum for ELk/ESw from 0.1 to 10.

These diagrams show reference, parallel and pipeline implementations of an ALU. In this experiment, VTH was swept in increments

Parallelism and Pipelining in E-D Space

2f

2f

(a) reference

(c) pipeline (b) parallel

A B

A B B

BA

A

f f

f f f

f

Af

(d) reference for time-mux (e) time-multiplex

f

Af f

Af f

f f2f 2f

reference parallel/pipeline

parallel

reference

pipeline

Time/opEn

ergy

/op It is important to link

back to E-D tradeoff

3.14

Minimum Energy: ELk/ESw ≈ 0.5

Topology Inv Add Dec(ELk/ESw)opt 0.8 0.5 0.2

Large (ELk/ESw)opt

Flat Eop minimum

Optimal designs have high leakage

10-2 10-1 100 1010

0.2

0.4

0.6

0.8

1

ELeakage/ESwitching

E Op /

nom

inal

EO

pre

f

nominalparallelpipeline

Vthref-180mV

0.81Vddmax

Vthref-95mV

0.57Vddmax

Vthref-140mV

0.52Vddmax

3.15

2opt

Lk

dSw

ELE ln Kα

ELeakage/ESwitching

E op

/ E o

pref(V

DDre

f , V T

Href )

10−20

10−1 100 101

0.2

0.4

0.6

0.8

1

[4] V. Stojanović et al., “Energy-Delay Tradeoffs in Combinational Logic using Gate Sizing and Supply Voltage Optimization,” in Proc. Eur. Solid-State Circuits Conf., Sept. 2002, pp. 211-214.

[4]

p


Time Multiplexing• Time multiplexing does

the opposite of pipelining/parallelism.

• It executes a number of parallel data streams on the same processing element (A).

• Due to the parallel-to-serial conversion at the input of block A, logic block A works with up-sampled data ( 2·f ) to maintain external throughput.

17


of 5 mV and for each VTH, the optimal VDD and sizing were found to minimize energy. Each point,

hence, has an ELk/ESw ratio that corresponds to minimum energy. The plot illustrates several points. First, the energy minimum is very flat and roughly corresponds to equalizing the leakage and

switching components of energy. This is the case in all three architectures. The result coincides with the three examples from the circuit level: inverter chain, adder, and memory decoder, as shown

in the table. Second, parallel and pipeline designs are more energy efficient than the reference design, because their logic runs slower and voltage scaling is possible. Third, the supply voltage at

the minimum-energy point is lower in the pipeline design than in the parallel design, because the pipeline design has smaller area and less leakage. This is also evidenced by a lower VTH in the

pipeline design.

Optimal designs, therefore, have high leakage when the total energy is minimized. This makes

sense for high-activity designs where we can balance the ELk/ESw ratio to minimize energy. In the idle mode ESw = 0, so the energy minimization problem reduces to leakage minimization.

Slide 3.16

Time multiplexing does just the opposite of pipelining / parallelism.

It executes a number of parallel data streams on the same

processing element (A). Due to the parallel-to-serial conversion at the

input of block A, logic block A works with up-sampled data ( 2·f )

to maintain external throughput. This imposes a more aggressive

timing constraint on logic block A. The reference design, hence, moves

toward higher energy and lower delay in the E-D space. Since the

hardware for logic is shared, the area of the time-multiplexed design is reduced.

Time Multiplexing

reference

time-mux

Af

(a) reference for time-mux (b) time-multiplex

f

Af f

Af f

f f2f 2f

time-muxreference

Time/op

Ener

gy/o

p

3.16

p


Data-Stream Interleaving• We frequently encounter parallel data in signal processing algorithms such as those

found in multi-carrier communication systems.

• We can time-interleave parallel data streams on the same hardware in order to take advantage of the difference in application speed requirements and the speed of the technology.

• Parallel-to-serial (P/S) conversion is applied to the incoming data stream to time-interleave the samples.

• The processing element in the interleaved architecturenow runs at a rate N-times faster to process all incoming data streams.

18

56 Chapter 3

Slide 3.17

We frequently encounter parallel data in signal processing algorithms such as those found in multi-carrier communication systems or medical applications. The example shown in the upper figure shows N processing elements (PEs) that process N independent streams of data. Given the speed of nano-scale technologies, the processing element can work much faster than the application requirement. Under such conditions we can time-interleave parallel data streams on the same hardware in order to take

advantage of the difference in application speed requirements and the speed of the technology. The interleaving approach is used to save hardware area.

Parallel-to-serial (P/S) conversion is applied to the incoming data stream to time-interleave the samples. The processing element in the interleaved architecture now runs at a rate N-times faster to process all incoming data streams. A serial-to-parallel (S/P) converter then splits the output data stream back into N parallel channels. It is important to note that extra pipeline registers need to be introduced for the interleaved processing element.

Slide 3.18

Let us consider a simple example of a processing element to illustrate interleaving. We assume a simple add-multiply operation and two inputs C1 and C2.

The processing element shown in this slide is a two-stream interleaved implementation of the function 1/(1 − a·z−1). Two input streams at a rate fs are interleaved at a rate 1/2fs. The key point of interleaving is to add an extra pipeline stage (red) as a memory element for the stream C2. This extra pipeline register can be

pushed into the multiplier, which is convenient because the multiplier now needs to operate at a higher speed to support interleaving. Interleaving, thus, means up-sampling and pipelining of the computation. It saves area by sharing datapath logic (add, mult). In recursive systems, the total loop latency has to be equal to the number of input sub-channels to ensure correct functionality.

Data-Stream Interleaving

SVDSVDSVDSVDSVDSVDSVDSVDSVDPE

Interleaved Architecture

N blocks

N N

fsymbol

PE

fsymbol fsymbol

P/S S/P

N ·∙ fsymbol

N

fsymbol fsymbol

N

… …

PE too fast

Large area

Reduced area

P/S overhead

Pipelined

PE = recursive operation

3.17

PE Performs Recursive Operation

Interleave = up-sample & pipeline

fs

a

c1(k + 1), c1(k)

fs

c2(k + 1), c2(k)

1/fs 1/fs

2fs

c2(k + 1), c1(k + 1), c2(k), c1(k)

1/fs

1/2fs

2fs

3.18

×

+

×

+

×

+

a

a

56 Chapter 3

Slide 3.17




Slide 3.18







N blocks

N N

fsymbol

PE

fsymbol fsymbol

P/S S/P

N ·∙ fsymbol

N

fsymbol fsymbol

N

… …

PE too fast

Large area

Reduced area

P/S overhead

Pipelined


3.17



fs

a

c1(k + 1), c1(k)

fs

c2(k + 1), c2(k)

1/fs 1/fs

2fs

c2(k + 1), c1(k + 1), c2(k), c1(k)

1/fs

1/2fs

2fs

3.18

×

+

×

+

×

+

a

a

p


PE performs recursive operation

19

56 Chapter 3

Slide 3.17




Slide 3.18







N blocks

N N

fsymbol

PE

fsymbol fsymbol

P/S S/P

N ·∙ fsymbol

N

fsymbol fsymbol

N

… …

PE too fast

Large area

Reduced area

P/S overhead

Pipelined


3.17



fs

a

c1(k + 1), c1(k)

fs

c2(k + 1), c2(k)

1/fs 1/fs

2fs

c2(k + 1), c1(k + 1), c2(k), c1(k)

1/fs

1/2fs

2fs

3.18

×

+

×

+

×

+

a

a

p


Folding

20


Slide 3.19

The example in this slide is an N data stream generalization of the previous example.

The initial PE that implements H(z) = 1/(1 − c·z−1) is now interleaved to operate on N data streams. This implies that N – 1 additional registers are added to the original single-channel design. Since the design now needs to run N-times faster, the extra registers can be used to pipeline the adder and multiplier. This could result in further performance improvement or energy reduction. The number

of pipeline stages pushed into the adder / multiplier is determined in such a way as to keep uniform logic depth for all paths. If, for a given cycle time, N is greater than the number of registers needed to pipeline the adder and multiplier, b = N – a – m extra registers are needed to balance the latency of the design. Data-stream interleaving is applicable for parallel data processing.

Slide 3.20

Folding, which is similar to data-stream interleaving in principle, is used for time-serial data processing. It involves up-sampling and pipelining of recursive loops, like in data-stream interleaving. Folding also needs an additional data ordering step (the mux in the folded architecture) to support time-serial execution. Like interleaving, folding also reduces area. As shown in the slide, a reference design with N PEs in series is transformed to a folded design with a single PE. This is

possible if the PE can run faster than the application requires, so the excess speed can be traded for reduced area. The folded architecture operates at a rate N-times higher than the reference architecture. The input to the PE is multiplexed between the external input and PE’s output.

Data-Stream Interleaving Example

Recursive operation:z(k) = x(k) + c ·z(k – 1)

N data streams:

bm

c

a

xN … x2 x1

zN … z2 z1

time index k

y1 y2 … yN

time index k – 1

z

a + b + m = NN · fclk

fclk

c

z(k)x(k)

y(k – 1)x1, x2, …, xN

Extra b registersto balance latency

3.19

×

+

… ……

……

…×

+

Folding

PE

Folded Architecture

N blocksfsymbol

PE

…

fsymbol

N ·∙ fsymbol

…

PE too fastLarge area

Reduced areaHighly pipelined


PE

…

0

1

N ·∙ fsymbol

…

fsymbol

N

12

3.20


Slide 3.19

The example in this slide is an N data stream generalization of the previous example.

The initial PE that implements H(z) = 1/(1 − c·z−1) is now interleaved to operate on N data streams. This implies that N – 1 additional registers are added to the original single-channel design. Since the design now needs to run N-times faster, the extra registers can be used to pipeline the adder and multiplier. This could result in further performance improvement or energy reduction. The number

of pipeline stages pushed into the adder / multiplier is determined in such a way as to keep uniform logic depth for all paths. If, for a given cycle time, N is greater than the number of registers needed to pipeline the adder and multiplier, b = N – a – m extra registers are needed to balance the latency of the design. Data-stream interleaving is applicable for parallel data processing.

Slide 3.20

Folding, which is similar to data-stream interleaving in principle, is used for time-serial data processing. It involves up-sampling and pipelining of recursive loops, like in data-stream interleaving. Folding also needs an additional data ordering step (the mux in the folded architecture) to support time-serial execution. Like interleaving, folding also reduces area. As shown in the slide, a reference design with N PEs in series is transformed to a folded design with a single PE. This is

possible if the PE can run faster than the application requires, so the excess speed can be traded for reduced area. The folded architecture operates at a rate N-times higher than the reference architecture. The input to the PE is multiplexed between the external input and PE’s output.

Data-Stream Interleaving Example

Recursive operation:z(k) = x(k) + c ·z(k – 1)

N data streams:

bm

c

a

xN … x2 x1

zN … z2 z1

time index k

y1 y2 … yN

time index k – 1

z

a + b + m = NN · fclk

fclk

c

z(k)x(k)

y(k – 1)x1, x2, …, xN

Extra b registersto balance latency

3.19

×

+

… ……

……

…×

+

Folding

PE

Folded Architecture

N blocksfsymbol

PE

…

fsymbol

N ·∙ fsymbol

…

PE too fastLarge area

Reduced areaHighly pipelined


PE

…

0

1

N ·∙ fsymbol

…

fsymbol

N

12

3.20

p


Architectural Transformations

• Parallelism and Pipelining:‣ reduce energy, increase

area

• Time multiplexing‣ increase energy, reduce

area

• Interleaving & folding‣ const energy, reduce

area

21

60 Chapter 3

Slide 3.25

In contrast to pipelining and parallelism, time multiplexing requires datapath logic to run faster in order to process many streams sequentially. Starting from the reference design, this means shorter time available for computation. Voltage and/or sizing need to increase in order to achieve faster delay. As a result of resource sharing, the area of the time-multiplexed design is reduced as compared to the reference design. The area reduction comes at the expense of increased energy-per-

operation.

Slide 3.26

Interleaving and folding reduce the area for the same energy by sharing the logic gates. Both techniques involve up-sampling and interleaving, so there is no time slack available to utilize and, hence, the supply voltage remains constant. That is why interleaving and folding map back to the reference point in the energy-delay space, but move toward reduced area in the energy-area space.

We can use these architectural techniques to reach a desired energy-delay-area point and this

slide shows a systematic way of how to do it.

Architectural Transformations (Cont.)

Interleaving & Folding– ≈ const Energy, reduce Area

time-mux

referencepipeline,intl,

time-mux

reference

pipelineparallelparallelfold

intl,fold

0 DelayArea

Energy

3.26

VDD scaling

Architectural Transformations (Cont.)

Time multiplexing– increase Energy, reduce Area

time-mux

referencepipeline,

time-mux

reference

pipelineparallelparallel

0 DelayArea

Energy

3.25

VDD scaling

p


Summary• Architecture parallelism and pipelining can be used to reduce power

(improve energy efficiency) by voltage scaling‣ Equivalently, performance can be improved for the same energy per

operation

• Performance-centric designs (that maximize BIPS) require shorter (fewer FO4 stages) logic pipelines

• Energy-performance optimal designs have about equal leakage and switching components of energy‣ Otherwise, one can be traded for another for further energy reduction

• Architecture techniques for direct (parallelism, pipelining) and recursive (interleaving, folding) systems can be analyzed in area- energy-performance plane for compact comparison‣ Latency (number of pipeline stages) is dictated by cycle time

22

Documents

Chip design stylestcdungnghi/CSD/CSD_c7_IC_design_ln4.pdf‣Field Programmable Gate Arrays (FPGA) ... The energy-delay analysis is extended to include area comparison and quantify