Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
UC Regents Spring 2014 © UCBCS 152 L3: Metrics
2014-1-28John Lazzaro
(not a prof - “John” is always OK)
CS 152Computer Architecture and Engineering
www-inst.eecs.berkeley.edu/~cs152/
TA: Eric Love
Lecture 3 – Metrics
Play:1Tuesday, February 11, 14
UC Regents Spring 2014 © UCBCS 152 L3: Metrics + Microcode
Topics for today’s lecture
Metrics: Estimating the “goodness” of a CPU design ... so that we can redesign the CPU to be “better”.
A case study in microcode control: the Motorola 68000, the CPU that powered the original Macintosh.[see Lecture 5 slides for this topic]
Short Break.
Administrivia: Will announce office hours soon ...2Tuesday, February 11, 14
Todd Hamilton, iWatch concept.
On the drawing board ...
3Tuesday, February 11, 14
Todd Hamilton, iWatch concept.
Gray-scale computer graphics model.
4Tuesday, February 11, 14
Todd Hamilton, iWatch concept.
Colorcomputer graphics model ...
5Tuesday, February 11, 14
Todd Hamilton, iWatch concept.
Animated model ...
Then the baton is passed to us.
We use models to do stepwise refinement of
the silicon that powers the
consumer product.
6Tuesday, February 11, 14
Four metrics:
Performance
Execution time of a program.
Cost
How many dollars to manufacture.
Energy
Joules required to execute a program.
Time to Market
Will we ship a product before our competitors?
Today’s Focus For a later lecture ...
For a later lecture ... For a later lecture ...
7Tuesday, February 11, 14
CS 152 L6: Performance UC Regents Fall 2006 © UCB
Performance Measurement(as seen by the customer)
8Tuesday, February 11, 14
UC Regents Fall 2006 © UCBCS 152 L6: Performance
Who (sensibly) upgrades CPUs often?A professional who turns CPU cycles into money, and who is cycle-limited.
Artist tool: animation,
video special effects.
9Tuesday, February 11, 14
UC Regents Fall 2006 © UCBCS 152 L6: Performance
How to decide to buy a new machine?
Measure After Effects “execution time” on a representative render “workload”
“Night flight”
City map and cloudscomputed
“on the fly” with fractals
CPU intensive Trivial I/O
(still shot from the movie)10Tuesday, February 11, 14
UC Regents Fall 2006 © UCBCS 152 L6: Performance
Interpreting Execution Time
Performance 1Execution Time
= = 2.85 renders/hour
1.5 GHz PB (Y) is N times faster than 1.25 GHz PB (X). N is ?
N =Performance (Y)
Execution Time (Y)
Execution Time (X)
Performance (X)= = 1. 19
PB 1.5 Ghz : 3. 4 renders/hour. PB 1.25 : 2.85 renders/hour.Might make the difference in meeting a deadline ...
Execution Time: 1265 seconds
PowerBookG41.25 GHz
11Tuesday, February 11, 14
CS 152 L6: Performance UC Regents Fall 2006 © UCB
Execution Time: time for one job to complete
2 CPUs: Execution Time vs Throughput
Throughput: # of independent jobs/hour completed
However, G5 and Opteron may have same throughput.
Assume G5 MP executiontime faster because AE isn’t parallelized on Opteron CPUs.
1.8xfaster.Implies parallel code on a Mac.
2 CPUs vs1 CPU,otherwisesimilar
12Tuesday, February 11, 14
CS 152 L6: Performance UC Regents Fall 2006 © UCB
Performance Measurement(as seen by a CPU designer)
Q. Why do we care about After Effect’s performance?A. We want the CPU we are designing to run it well !
13Tuesday, February 11, 14
UC Regents Fall 2006 © UCBCS 152 L6: Performance
Step 1: Analyze the right measurement!
CPU Time:Time the CPU spends running program under measurement.
Response Time:
Total time: CPU Time + time spent waiting (for disk, I/O, ...).
Guides CPU design
Guides system design
Measuring CPU time (Unix):% time <program name>25.77u 0.72s 0:29.17 90.8%
14Tuesday, February 11, 14
UC Regents Fall 2006 © UCBCS 152 L6: Performance
CPU time: Proportional to Instruction Count
CPU timeProgram
Machine InstructionsProgram
∝
Q. Static count?(lines of program printout)Or dynamic count? (trace of execution)
Rationale: Every additional
instruction you execute takes time.
Q. How does a architect influence the number of machine instructions needed to run an algorithm?A. Create new instructions:instruction set architect.
A. Dynamic.
Q. Once ISA is set, who can influence instructioncount?A. Compiler writer,application developer.
15Tuesday, February 11, 14
UC Regents Fall 2006 © UCBCS 152 L6: Performance
CPU time: Proportional to Clock Period
TimeProgram
TimeOne Clock Period
∝
Q. What ultimately limitsan architect’s ability to reduce clock period ?
Q. How can architects (not technologists) reduce clock period?
We will revisit these questions later in lecture ...
Rationale: We measure each instruction’s
execution time in “number of cycles”. By shortening the period for
each cycle, we shorten execution time.
16Tuesday, February 11, 14
UC Regents Fall 2006 © UCBCS 152 L6: Performance
Completing the performance equation
SecondsProgram
InstructionsProgram
= SecondsCycle
We need all three terms, and only these terms, to
compute CPU Time!
Q. When is it OK to compare clock rates?
What factors make different programs have different CPIs? Instruction mix varies.
Cache behavior varies.
Branch prediction varies.
“CPI” -- The Average Number of Clock
Cycles Per Instruction For the Program
InstructionCycles
A. When other RHS terms are equal.17Tuesday, February 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L5: Pipelining
Consider Lecture 2 single-cycle CPU ...
All instructions take 1 cycle to execute every time they run.
CPI of any program running on machine? 1.0
“average CPI for the program” is a more-useful concept for more complicated machines ...
18Tuesday, February 11, 14
UC Regents Spring 2014 © UCBCS 152 L3: Metrics + Microcode
Recall Lecture 2: Multi-flow VLIW CPU
opcode rs rt rd functshamt
opcode rs rt rd functshamt
Syntax: ADD $8 $9 $10 Semantics:$8 = $9 + $10
Syntax: ADD $7 $8 $9 Semantics:$7 = $8 + $9
N x 32-bit VLIW yields factor of N speedup! Multiflow: N = 7, 14, or 28 (3 CPUs in product family)
SecondsProgram
InstructionsProgram
= SecondsCycle Instruction
Cycles
Q. Which right-hand-side term decreases with “N” ?
A. This one gets smaller.
A. We hope this one doesn’t grow.
19Tuesday, February 11, 14
UC Regents Fall 2006 © UCBCS 152 L6: Performance
Consider machine with a data cache ...
InstructionsProgram
= SecondsCycle
A program’s load instructions “stride”
through every memory address.
The cache never “hits”, so every load goes to DRAM
(100x slower than loads that go to cache).
Thus, the average number of cycles for load instructions is higher for this program.
InstructionCycles
Thus, the average number of cycles for all instructions is higher for this program.
SecondsProgram
Thus, program takes longer to run!20Tuesday, February 11, 14
UC Regents Fall 2006 © UCBCS 152 L6: Performance
CPI as an analytical tool to guide design
Multiply
Other A
LU Load
Store
Branch
2221
5
Machine CPI(throughput, not latency)
5 x 30 + 1 x 20 + 2 x 20 + 2 x 10 + 2 x 20100
= 2.7 cycles/instruction
Branch20%
Store10%
Load20%
Other ALU20%
Multiply30%
ProgramInstruction Mix
Where program spends its time
Branch15%7%
Load15%
7%
Multiply56%
20/270
Now we know how to optimize the design ...
21Tuesday, February 11, 14
UC Regents Fall 2006 © UCBCS 152 L6: Performance
Final thoughts: Performance Equation
SecondsProgram
InstructionsProgram
= SecondsCycle Instruction
Cycles
Goal is to optimize execution time, notindividualequationterms.
The CPI of the program.
Reflectsthe
program’s instruction
mix.
Machinesare
optimizedwith
respect toprogram
workloads.
Clockperiod.
Optimizejointlywith
machineCPI.
22Tuesday, February 11, 14
CS 152 L6: Performance UC Regents Fall 2006 © UCB
Invented the “one ISA, many implementations” business model.23Tuesday, February 11, 14
UC Regents Fall 2006 © UCBCS 152 L6: Performance
Amdahl’s Law (of Diminishing Returns)
If enhancement “E” makes multiply infinitely fast, but other
instructions are unchanged, what is the maximum speedup “S”?
Branch16%
8%
Load16%
8%
Multiply52%
Where programspends its time
S =1
(post-enhancement %) / 100%= 2.081
48%/100%=
Attributed to Gene Amdahl -- “Amdahl’s Law”
What is the lesson of Amdahl’s Law? Must enhance computers in a balanced way!
24Tuesday, February 11, 14
UC Regents Fall 2006 © UCBCS 152 L6: Performance
ProgramWeWishTo RunOn N CPUs
Serial30%
Parallel70%
The program spends 30%of its time running code that can not be recoded to run in parallel.
S =1
(30 % + (70% / N) ) / 100 %
CPUs 2 3 4 5 ∞
Speedup 1.54 1.85 2.1 2.3 3.3
Amdahl’s Law in Action
S(∞)
2 3 # CPUs
25Tuesday, February 11, 14
UC Regents Fall 2006 © UCBCS 152 L6: Performance
Real-world 2006: 2 CPUs vs 4 CPUs
20 in iMacCore Duo 2, 2.16 GHz$1500
Mac Pro2 Dual-Core Xeons, 2.66 GHz$3200 w/ 20 inch display.
26Tuesday, February 11, 14
UC Regents Fall 2006 © UCBCS 152 L6: Performance
Real-world 2006: 2 CPUs vs 4 CPUs 2 cores on one die.
4 cores on two dies.
Caveat: Mac Pro CPUs are server-class and have architectural advantages (better I/O, ECC DRAM, ETC)
ZIPing a file: very difficult to parallelize.
Simple audio and video tasks: easier to parallelize.
Amdahl’s Law + Real-World Legacy Code Issues in action. Source: MACWORLD
27Tuesday, February 11, 14
UC Regents Spring 2014 © UCBCS 152 L3: Metrics
Break
Play:28Tuesday, February 11, 14
UC Regents Spring 2014 © UCBCS 152 L3: Metrics
Timing
29Tuesday, February 11, 14
UC Regents Fall 2006 © UCBCS 152 L6: Performance
CPU time: Proportional to Clock Period
TimeProgram
TimeOne Clock Period
∝
Q. What ultimately limitsan architect’s ability to reduce clock period ?
Q. How can architects (not technologists) reduce clock period?
In this part of lecture: we answer these questions ...
Rationale: We measure each instruction’s
execution time in “number of cycles”. By shortening the period for
each cycle, we shorten execution time.
30Tuesday, February 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L6: Timing
Goal: Determine minimum clock period
32rd1
RegFile
32rd2
WE32wd
5rs1
5rs2
5ws
ExtRegDest
ALUsrcExtOp
ALUctr
32A
L
U
32
32
op
MemToReg
32Dout
Data Memory
WE32
Din
Addr
MemWr
Equal
RegWr
Equal
Control Lines
Combinational Logic
Clk
32
Addr Data
Instr
Mem
32D
PC
Q
32
32
+
32
32
0x4
PCSrc
32
+
32
op rs rt immediate016212631
E
x
t
e
n
d
31Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
A Logic Circuit Primer
“Models should be as simple as possible, but no simpler ...” Albert Einstein.
32Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Inverters: A simple transistor model
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.5
Design Refinement
Informal System Requirement
Initial Specification
Intermediate Specification
Final Architectural Description
Intermediate Specification of Implementation
Final Internal Specification
Physical Implementation
refinementincreasing level of detail
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.6
Logic Components
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.7
° Wires: Carry signals from one point to another• Single bit (no size label) or multi-bit bus (size label)
° Combinational Logic: Like function evaluation• Data goes in, Results come out after some propagation delay
° Flip-Flops: Storage Elements• After a clock edge, input copied to output
• Otherwise, the flip-flop holds its value
• Also: a “Latch” is a storage element that is level triggered
D Q D[8] Q[8]
8
Combinational
Logic
11
8
Elements of the design zoo
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.8
Basic Combinational Elements+DeMorgan Equivalence
Wire Inverter
In Out
01
01
In Out
10
01
OutIn
Out = InOut = In
NAND Gate NOR GateA B Out
111
0 00 11 01 1 0
A B Out
0 0 10 1 01 0 01 1 0
OutA
BA
B
Out
DeMorgan’s
TheoremOut = A + B = A • BOut = A • B = A + B
A
B
Out
A B Out
1 1 11 0 10 1 10 0 0
0 00 11 01 1
A B
OutA
B
A B Out
1 1 11 0 00 1 00 0 0
0 00 11 01 1
A B
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
“1”
“0”
pFET.A switch.
“On” if gate is
grounded.
nFET.A switch.
“On” if gate is at Vdd.
“1”“0”
“1” “0”
Correctly predicts logic output for simple static CMOS circuits.
Extensions to model subtler circuit families, or to predict timing, have not worked well ...
33Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Transistors as water valves.If electrons are water molecules,
transistor strengths (W/L) are pipe diameters, and capacitors are buckets ...
A “on” p-FET fillsup the capacitor
with charge.
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
A “on” n-FET empties the
bucket.
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.29
Delay Model:
CMOS
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.30
Review: General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• load factor of each input
• critical propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.31
Basic Technology: CMOS
° CMOS: Complementary Metal Oxide Semiconductor• NMOS (N-Type Metal Oxide Semiconductor) transistors
• PMOS (P-Type Metal Oxide Semiconductor) transistors
° NMOS Transistor• Apply a HIGH (Vdd) to its gate
turns the transistor into a “conductor”
• Apply a LOW (GND) to its gateshuts off the conduction path
° PMOS Transistor• Apply a HIGH (Vdd) to its gate
shuts off the conduction path
• Apply a LOW (GND) to its gateturns the transistor into a “conductor”
Vdd = 5V
GND = 0v
Vdd = 5V
GND = 0v
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.32
Basic Components: CMOS Inverter
Vdd
Circuit
° Inverter Operation
OutIn
SymbolPMOS
NMOS
In Out
Vdd
Open
Charge
VoutVdd
Vdd
Out
Open
Discharge
Vin
Vdd
Vdd
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)
!"#$%&'(#)*(+,%-$*".(/0
1 2+.$0#$03
1 4546%,"#$3
“1”
“0”Time
Water level
!"#$%&'())* ++,!-.)'/ 012-)34$5$%& 67&1'-)
!"#$%&'(#)*(+,%-$*".(/0
1 2+.$0#$03
1 4546%,"#$3
“0”
“1”
TimeWater level
This model is often good enough ...
(Cartoon physics)
34Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
What is the bucket? A gate’s “fan-out”.
Driving other gates slows a gate down.
Spring 2003 EECS150 – Lec10-Timing Page 10
Gate Switching Behavior
• Inverter:
• NAND gate:
Driving wires slows a gate down.
“Fan-out”: The number of gate inputs driven by a gate’s output.
Driving it’s own parasitics slows a gate down.
35Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Fanout
36Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
A closer look at fan-out ...
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.37
Series Connection
Vdd
Cout
Vout
C1
V1G2
Vdd
Voltage
Vdd
Vin
GND
V1 Vout
Vdd/2
d1 d2
G1
V1Vin Vout
VinG1 G2
Time
° Total Propagation Delay = Sum of individual delays = d1 + d2
° Capacitance C1 has two components:
• Capacitance of the wire connecting the two gates
• Input capacitance of the second inverter
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.38
Calculating Aggregate Delays
Vdd
G2
Vdd
° Sum delays along serial paths
° Delay (Vin -> V2) ! = Delay (Vin -> V3)• Delay (Vin -> V2) = Delay (Vin -> V1) + Delay (V1 -> V2)
• Delay (Vin -> V3) = Delay (Vin -> V1) + Delay (V1 -> V3)
° Critical Path = The longest among the N parallel paths
° C1 = Wire C + Cin of Gate 2 + Cin of Gate 3
V2
V1Vin V2
G1V1
C1
Vin
Vdd
G3V3
V3
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.39
Characterize a Gate
° Input capacitance for each input
° For each input-to-output path:• For each output transition type (H->L, L->H, H->Z, L->Z ... etc.)
- Internal delay (ns)
- Load dependent delay (ns / fF)
° Example: 2-input NAND Gate
OutA
B
Delay A -> Out
Out: Low -> High
0.5ns
Slope =
0.0021ns / fF
For A and B: Input Load (I.L.) = 61 fF
For either A -> Out or B -> Out:
Tlh = 0.5ns Tlhf = 0.0021ns / fF
Thl = 0.1ns Thlf = 0.0020ns / fF
Cout
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.40
A Specific Example: 2 to 1 MUX
Y = (A and !S)
or (B and S)
A
B
S
Gate 3
Gate 2
Gate 1Wire 1
Wire 2
Wire 0
A
B
Y
S
2 x
1M
ux
° Input Load (I.L.)• A, B: I.L. (NAND) = 61 fF
• S: I.L. (INV) + I.L. (NAND) = 50 fF + 61 fF = 111 fF
° Load Dependent Delay (L.D.D.): Same as Gate 3• TAYlhf = 0.0021 ns / fF TAYhlf = 0.0020 ns / fF
• TBYlhf = 0.0021 ns / fF TBYhlf = 0.0020 ns / fF
• TSYlhf = 0.0021 ns / fF TSYlhf = 0.0020 ns / fF
Linear model works for
reasonablefan-out
Spring 2003 EECS150 – Lec10-Timing Page 12
Gate Delay
• Fan-out:
• The delay of a gate is proportional to its output capacitance. Because, gates #2 and 3 turn on/off at a later time. (It takes longer for the output of gate #1 to reach the switching threshold of gates #2 and 3 as we add more output capacitance.)
1
3
2
Delay time of an inverter driving 4 inverters.
FO4: Fanout of four delay.
Driving more gates adds delay.
Spring 2003 EECS150 – Lec10-Timing Page 12
Gate Delay
• Fan-out:
• The delay of a gate is proportional to its output capacitance. Because, gates #2 and 3 turn on/off at a later time. (It takes longer for the output of gate #1 to reach the switching threshold of gates #2 and 3 as we add more output capacitance.)
1
3
2
Spring 2003 EECS150 – Lec10-Timing Page 12
Gate Delay
• Fan-out:
• The delay of a gate is proportional to its output capacitance. Because, gates #2 and 3 turn on/off at a later time. (It takes longer for the output of gate #1 to reach the switching threshold of gates #2 and 3 as we add more output capacitance.)
1
3
2
37Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Propagation delay graphs ...
Spring 2003 EECS150 – Lec10-Timing Page 11
Gate Delay
• Cascaded gates:
Vout
Vin
Spring 2003 EECS150 – Lec10-Timing Page 11
Gate Delay
• Cascaded gates:
Vout
Vin
Spring 2003 EECS150 – Lec10-Timing Page 11
Gate Delay
• Cascaded gates:
Vout
Vin
Spring 2003 EECS150 – Lec10-Timing Page 11
Gate Delay
• Cascaded gates:
Vout
Vin
1 ->0 1 ->0
0 ->1 0 ->1
inverter transfer function
38Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Worst-case delay through combinational logic
Spring 2003 EECS150 – Lec10-Timing Page 13
Gate Delay
• “Fan-in”
• What is the delay in this circuit?
• Critical Path: the path with the maximum delay, from any
input to any output.
– In general, we include register set-up and clk-to-Q times in
critical path calculation.
• Why do we care about the critical path?
x = g(a, b, c, d, e, f)
T2 might be the
worst-case delay path
(critical path)
If d going 0-to-1 switches x 0-to-1, delay is T1.If a going 0-to-1 switches x 0-to-1, delay is T2.
It would be surprising if T1 > T2.
T1
T2
0 ->1
0 ->10 ->1
39Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Why “might”? Wires have delay too ...
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
Spring 2003 EECS150 – Lec10-Timing Page 16
Wire Delay
• Even in those cases where the
transmission line effect is
negligible:
– Wires posses distributed
resistance and capacitance
– Time constant associated with
distributed RC is proportional to
the square of the length
• For short wires on ICs,
resistance is insignificant
(relative to effective R of
transistors), but C is important.
– Typically around half of C of
gate load is in the wires.
• For long wires on ICs:
– busses, clock lines, global
control signal, etc.
– Resistance is significant,
therefore distributed RC effect
dominates.
– signals are typically “rebuffered”
to reduce delay:v1
v4v3
v2
time
v1 v2 v3 v4
Looksbenign,
but ...
40Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Clocked Logic Circuits
41Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
From Delay Models to Timing Analysis1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Spring 2003 EECS150 – Lec10-Timing Page 7
Example
• Parallel to serial converter:
a
b T ! time(clk"Q) + time(mux) + time(setup)
T ! #clk"Q + #mux + #setup
clk
f T1 MHz 1 μs
10 MHz 100 ns100 MHz 10 ns
1 GHz 1 ns
Timing AnalysisWhat is the
smallest T that produces correct
operation?
42Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Timing Analysis and Logic Delay
If our clock period T > worst-case delay through CL, does this ensure correct operation?
1600
IEEEJOURNALOFSOLID-STATECIRCUITS,VOL.36,NO.11,NOVEMBER2001
Fig.1.ProcessSEMcrosssection.
Theprocess
wasraisedfrom[1]tolimitstandbypower.
Circuitdesignandarchitecturalpipeliningensurelowvoltage
performanceandfunctionality.Tofurtherlimitstandbycurrent
inhandheldASSPs,alongerpolytargettakesadvantageofthe
versus
dependenceandsource-to-bodybiasisused
toelectricallylimittransistor
instandbymode.Allcore
nMOSandpMOStransistorsutilizeseparatesourceandbulk
connectionstosupportthis.Theprocessincludescobaltdisili-
cidegatesanddiffusions.Lowsourceanddraincapacitance,as
wellas3-nmgate-oxidethickness,allowhighperformanceand
low-voltageoperation. III.ARCHITECTURE
Themicroprocessorcontains32-kBinstructionanddata
cachesaswellasaneight-entrycoalescingwritebackbuffer.
Theinstructionanddatacachefillbuffershavetwoandfour
entries,respectively.Thedatacachesupportshit-under-miss
operationandlinesmaybelockedtoallowSRAM-likeoper-
ation.Thirty-two-entryfullyassociativetranslationlookaside
buffers(TLBs)thatsupportmultiplepagesizesareprovided
forbothcaches.TLBentriesmayalsobelocked.A128-entry
branchtargetbufferimprovesbranchperformanceapipeline
deeperthanearlierhigh-performanceARMdesigns[2],[3].
A.PipelineOrganization
Toobtainhighperformance,themicroprocessorcoreutilizes
asimplescalarpipelineandahigh-frequencyclock.Inaddition
toavoidingthepotentialpowerwasteofasuperscalarapproach,
functionaldesignandvalidationcomplexityisdecreasedatthe
expenseofcircuitdesigneffort.Toavoidcircuitdesignissues,
thepipelinepartitioningbalancestheworkloadandensuresthat
noonepipelinestageistight.Themainintegerpipelineisseven
stages,memoryoperationsfollowaneight-stagepipeline,and
whenoperatinginthumbmodeanextrapipestageisinserted
afterthelastfetchstagetoconvertthumbinstructionsintoARM
instructions.Sincethumbmodeinstructions[11]are16b,two
instructionsarefetchedinparallelwhileexecutingthumbin-
structions.Asimplifieddiagramoftheprocessorpipelineis
Fig.2.Microprocessorpipelineorganization.
showninFig.2,wherethestateboundariesareindicatedby
gray.Featuresthatallowthemicroarchitecturetoachievehigh
speedareasfollows.
TheshifterandALUresideinseparatestages.TheARMin-
structionsetallowsashiftfollowedbyanALUoperationina
singleinstruction.Previousimplementationslimitedfrequency
byhavingtheshiftandALUinasinglestage.Splittingthisop-
erationreducesthecriticalALUbypasspathbyapproximately
1/3.Theextrapipelinehazardintroducedwhenaninstructionis
immediatelyfollowedbyonerequiringthattheresultbeshifted
isinfrequent.
DecoupledInstructionFetch.Atwo-instructiondeepqueueis
implementedbetweenthesecondfetchandinstructiondecode
pipestages.Thisallowsstallsgeneratedlaterinthepipetobe
deferredbyoneormorecyclesintheearlierpipestages,thereby
allowinginstructionfetchestoproceedwhenthepipeisstalled,
andalsorelievesstallspeedpathsintheinstructionfetchand
branchpredictionunits.
Deferredregisterdependency
stalls.Whileregisterdepen-
denciesarecheckedintheRFstage,stallsduetothesehazards
aredeferreduntiltheX1stage.Allthenecessaryoperandsare
thencapturedfromresult-forwardingbussesastheresultsare
returnedtotheregisterfile.
Oneofthemajorgoalsofthedesignwastominimizetheen-
ergyconsumedtocompleteagiventask.Conventionalwisdom
hasbeenthatshorterpipelinesaremoreefficientduetore-
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.9
General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• Input load factor of each input
• Propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.10
Storage Element’s Timing Model
Clk
D Q
° Setup Time: Input must be stable BEFORE trigger clock edge
° Hold Time: Input must REMAIN stable after trigger clock edge
° Clock-to-Q time:
• Output cannot change instantaneously at the trigger clock edge
• Similar to delay in logic gates, two components:
- Internal Clock-to-Q
- Load dependent Clock-to-Q
Don’t Care Don’t Care
HoldSetup
D
Unknown
Clock-to-Q
Q
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.11
Clocking Methodology
Clk
Combination Logic
.
.
.
.
.
.
.
.
.
.
.
.
° All storage elements are clocked by the same clock edge
° The combination logic blocks:• Inputs are updated at each clock tick
• All outputs MUST be stable before the next clock tick
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.12
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
° Critical path: the slowest path between any two storage devices
° Cycle time is a function of the critical path
° must be greater than:
Clock-to-Q + Longest Path through Combination Logic + Setup
Register:
An Array of Flip-Flops
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.9
General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• Input load factor of each input
• Propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.10
Storage Element’s Timing Model
Clk
D Q
° Setup Time: Input must be stable BEFORE trigger clock edge
° Hold Time: Input must REMAIN stable after trigger clock edge
° Clock-to-Q time:
• Output cannot change instantaneously at the trigger clock edge
• Similar to delay in logic gates, two components:
- Internal Clock-to-Q
- Load dependent Clock-to-Q
Don’t Care Don’t Care
HoldSetup
D
Unknown
Clock-to-Q
Q
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.11
Clocking Methodology
Clk
Combination Logic
.
.
.
.
.
.
.
.
.
.
.
.
° All storage elements are clocked by the same clock edge
° The combination logic blocks:• Inputs are updated at each clock tick
• All outputs MUST be stable before the next clock tick
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.12
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
° Critical path: the slowest path between any two storage devices
° Cycle time is a function of the critical path
° must be greater than:
Clock-to-Q + Longest Path through Combination Logic + Setup
Combinational Logic
43Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Flip Flops have internal delays ...
D Q
CLK
Value of D is sampled on positive clock edge.Q outputs sampled value for rest of cycle.
D
Q
t_setup
t_clk-to-Q
44Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Flip-Flop delays eat into “time budget”1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Spring 2003 EECS150 – Lec10-Timing Page 7
Example
• Parallel to serial converter:
a
b T ! time(clk"Q) + time(mux) + time(setup)
T ! #clk"Q + #mux + #setup
clk
ALU “time budget”
Spring 2003 EECS150 – Lec10-Timing Page 8
General Model of Synchronous Circuit
• In general, for correct operation:
for all paths.
• How do we enumerate all paths?
– Any circuit input or register output to any register input or circuit
output.
– “setup time” for circuit outputs depends on what it connects to
– “clk-Q time” for circuit inputs depends on from where it comes.
reg regCL CL
clock input
output
option feedback
input output
T ! time(clk"Q) + time(CL) + time(setup)
T ! #clk"Q + #CL + #setup
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.9
General C/L Cell Delay Model
° Combinational Cell (symbol) is fully specified by:• functional (input -> output) behavior
- truth-table, logic equation, VHDL
• Input load factor of each input
• Propagation delay from each input to each output for each transition
- THL(A, o) = Fixed Internal Delay + Load-dependent-delay x load
° Linear model composes
Cout
Vout
Cout
Delay
Va -> Vout
XX
X
X
X
X
Ccritical
delay per unit load
A
B
X
.
.
.
Combinational
Logic Cell
Internal Delay
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.10
Storage Element’s Timing Model
Clk
D Q
° Setup Time: Input must be stable BEFORE trigger clock edge
° Hold Time: Input must REMAIN stable after trigger clock edge
° Clock-to-Q time:
• Output cannot change instantaneously at the trigger clock edge
• Similar to delay in logic gates, two components:
- Internal Clock-to-Q
- Load dependent Clock-to-Q
Don’t Care Don’t Care
HoldSetup
D
Unknown
Clock-to-Q
Q
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.11
Clocking Methodology
Clk
Combination Logic
.
.
.
.
.
.
.
.
.
.
.
.
° All storage elements are clocked by the same clock edge
° The combination logic blocks:• Inputs are updated at each clock tick
• All outputs MUST be stable before the next clock tick
1/28/04 ©UCB Spring 2004CS152 / Kubiatowicz
Lec3.12
Critical Path & Cycle Time
Clk
.
.
.
.
.
.
.
.
.
.
.
.
° Critical path: the slowest path between any two storage devices
° Cycle time is a function of the critical path
° must be greater than:
Clock-to-Q + Longest Path through Combination Logic + Setup
Combinational Logic
45Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Clock skew also eats into “time budget”
Spring 2003 EECS150 – Lec10-Timing Page 18
Clock Skew (cont.)
• If clock period T = TCL+Tsetup+Tclk!Q, circuit will fail.
• Therefore:
1. Control clock skew
a) Careful clock distribution. Equalize path delay from clock source to all clock loads by controlling wires delay and buffer delay.
b) don’t “gate” clocks.
2. T " TCL+Tsetup+Tclk!Q + worst case skew.
• Most modern large high-performance chips (microprocessors) control end to end clock skew to a few tenths of a nanosecond.
clock skew, delay in distribution
CL
CLKCLK’
CLK
CLK’
Spring 2003 EECS150 – Lec10-Timing Page 19
Clock Skew (cont.)
• Note reversed buffer.
• In this case, clock skew actually provides extra time (adds
to the effective clock period).
• This effect has been used to help run circuits as higher
clock rates. Risky business!
CL
CLK
CLK’
clock skew, delay in distribution
CLK
CLK’
As T →0, which circuit
fails first?
Spring 2003 EECS150 – Lec10-Timing Page 18
Clock Skew (cont.)
• If clock period T = TCL+Tsetup+Tclk!Q, circuit will fail.
• Therefore:
1. Control clock skew
a) Careful clock distribution. Equalize path delay from clock source to all clock loads by controlling wires delay and buffer delay.
b) don’t “gate” clocks.
2. T " TCL+Tsetup+Tclk!Q + worst case skew.
• Most modern large high-performance chips (microprocessors) control end to end clock skew to a few tenths of a nanosecond.
clock skew, delay in distribution
CL
CLKCLK’
CLK
CLK’
CLKd CLKd
Spring 2003 EECS150 – Lec10-Timing Page 18
Clock Skew (cont.)
• If clock period T = TCL+Tsetup+Tclk!Q, circuit will fail.
• Therefore:
1. Control clock skew
a) Careful clock distribution. Equalize path delay from clock source to all clock loads by controlling wires delay and buffer delay.
b) don’t “gate” clocks.
2. T " TCL+Tsetup+Tclk!Q + worst case skew.
• Most modern large high-performance chips (microprocessors) control end to end clock skew to a few tenths of a nanosecond.
clock skew, delay in distribution
CL
CLKCLK’
CLK
CLK’CLKd
46Tuesday, February 11, 14
UC Regents Fall 2006 © UCBCS 152 L5: Timing
Clocks have dedicated wires (low skew)
Spartan-3 FPGA Family: Functional Description
30 www.xilinx.com DS099-2 (v1.3) August 24, 2004Preliminary Product Specification
40
R
width of the die. In turn, the horizontal spine branches out into a subsidiary clock interconnect that accesses the CLBs.
2. The clock input of either DCM on the same side of the die — top or bottom — as the BUFGMUX element in use.
A Global clock input is placed in a design using either aBUFGMUX element or the BUFG (Global Clock Buffer) ele-ment. For the purpose of minimizing the dynamic power dis-sipation of the clock network, the Xilinx developmentsoftware automatically disables all clock line segments thata design does not use.
Figure 18: Spartan-3 Clock Network (Top View)
4
4
4
4
4
4
4
8
8
4
4
88
Horizontal Spine
Top
Spi
neB
otto
m S
pine
4
DCM DCM
DCM DCM
Array Dependent
Array Dependent
•
•
•
•
•
•
•
•
•
•
•
•
DS099-2_18_070203
4 BUFGMUX
GCLK2GCLK3
GCLK0GCLK1
4 BUFGMUX
GCLK6 GCLK4GCLK7 GCLK5
From: Xilinx Spartan 3 data sheet. Virtex issimilar.
“Clock tree”
Flip flop clock inputs are the “leaves” of the “tree”.
47Tuesday, February 11, 14
Gold wires form clock tree.
Die photo: Xilinx Virtex Pro
48Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [16] was required to tune themore than 2000 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3D visualizations. Figure 8 visualizes four ofthe 64 sector trees containing about 125 tuned wiresdriving 1/16th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [17, 18]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 15 200 clock pins on the chip,regardless of proximity.
From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.
Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed
Figure 6
Schematic diagram of global clock generation and distribution.
PLL
Bypass
Referenceclock in
Referenceclock out
Clock distributionClock out
Figure 7
3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.
Del
ayGrid
Tunedsectortrees
Sectorbuffers
Buffer level 2
Buffer level 1
y
x
Figure 8
Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.
Del
ay Multiple-fingeredtransmissionline
yx
J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002
32
Clock Tree Delays,
IBM “Power” CPU
Del
ay
49Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
the total wire delay is similar to the total buffer delay. Apatented tuning algorithm [16] was required to tune themore than 2000 tunable transmission lines in these sectortrees to achieve low skew, visualized as the flatness of thegrid in the 3D visualizations. Figure 8 visualizes four ofthe 64 sector trees containing about 125 tuned wiresdriving 1/16th of the clock grid. While symmetric H-treeswere desired, silicon and wiring blockages often forcedmore complex tree structures, as shown. Figure 8 alsoshows how the longer wires are split into multiple-fingeredtransmission lines interspersed with Vdd and ground shields(not shown) for better inductance control [17, 18]. Thisstrategy of tunable trees driving a single grid results in lowskew among any of the 15 200 clock pins on the chip,regardless of proximity.
From the global clock grid, a hierarchy of short clockroutes completed the connection from the grid down tothe individual local clock buffer inputs in the macros.These clock routing segments included wires at the macrolevel from the macro clock pins to the input of the localclock buffer, wires at the unit level from the macro clockpins to the unit clock pins, and wires at the chip levelfrom the unit clock pins to the clock grid.
Design methodology and resultsThis clock-distribution design method allows a highlyproductive combination of top-down and bottom-up designperspectives, proceeding in parallel and meeting at thesingle clock grid, which is designed very early. The treesdriving the grid are designed top-down, with the maximumwire widths contracted for them. Once the contract for thegrid had been determined, designers were insulated fromchanges to the grid, allowing necessary adjustments to thegrid to be made for minimizing clock skew even at a verylate stage in the design process. The macro, unit, and chipclock wiring proceeded bottom-up, with point tools ateach hierarchical level (e.g., macro, unit, core, and chip)using contracted wiring to form each segment of the totalclock wiring. At the macro level, short clock routesconnected the macro clock pins to the local clock buffers.These wires were kept very short, and duplication ofexisting higher-level clock routes was avoided by allowingthe use of multiple clock pins. At the unit level, clockrouting was handled by a special tool, which connected themacro pins to unit-level pins, placed as needed in pre-assigned wiring tracks. The final connection to the fixed
Figure 6
Schematic diagram of global clock generation and distribution.
PLL
Bypass
Referenceclock in
Referenceclock out
Clock distributionClock out
Figure 7
3D visualization of the entire global clock network. The x and y coordinates are chip x, y, while the z axis is used to represent delay, so the lowest point corresponds to the beginning of the clock distribution and the final clock grid is at the top. Widths are proportional to tuned wire width, and the three levels of buffers appear as vertical lines.
Del
ay
Grid
Tunedsectortrees
Sectorbuffers
Buffer level 2
Buffer level 1
y
x
Figure 8
Visualization of four of the 64 sector trees driving the clock grid, using the same representation as Figure 7. The complex sector trees and multiple-fingered transmission lines used for inductance control are visible at this scale.
Del
ay Multiple-fingeredtransmissionline
yx
J. D. WARNOCK ET AL. IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002
32
Clock Tree Delays, IBM Power
clock grid was completed with a tool run at the chip level,connecting unit-level pins to the grid. At this point, theclock tuning and the bottom-up clock routing process stillhave a great deal of flexibility to respond rapidly to evenlate changes. Repeated practice routing and tuning wereperformed by a small, focused global clock team as theclock pins and buffer placements evolved to guaranteefeasibility and speed the design process.
Measurements of jitter and skew can be carried outusing the I/Os on the chip. In addition, approximately 100top-metal probe pads were included for direct probingof the global clock grid and buffers. Results on actualPOWER4 microprocessor chips show long-distanceskews ranging from 20 ps to 40 ps (cf. Figure 9). This isimproved from early test-chip hardware, which showedas much as 70 ps skew from across-chip channel-lengthvariations [19]. Detailed waveforms at the input andoutput of each global clock buffer were also measuredand compared with simulation to verify the specializedmodeling used to design the clock grid. Good agreementwas found. Thus, we have achieved a “correct-by-design”clock-distribution methodology. It is based on our designexperience and measurements from a series of increasinglyfast, complex server microprocessors. This method resultsin a high-quality global clock without having to usefeedback or adjustment circuitry to control skews.
Circuit designThe cycle-time target for the processor was set early in theproject and played a fundamental role in defining thepipeline structure and shaping all aspects of the circuitdesign as implementation proceeded. Early on, criticaltiming paths through the processor were simulated indetail in order to verify the feasibility of the designpoint and to help structure the pipeline for maximumperformance. Based on this early work, the goal for therest of the circuit design was to match the performance setduring these early studies, with custom design techniquesfor most of the dataflow macros and logic synthesis formost of the control logic—an approach similar to thatused previously [20]. Special circuit-analysis and modelingtechniques were used throughout the design in order toallow full exploitation of all of the benefits of the IBMadvanced SOI technology.
The sheer size of the chip, its complexity, and thenumber of transistors placed some important constraintson the design which could not be ignored in the push tomeet the aggressive cycle-time target on schedule. Theseconstraints led to the adoption of a primarily static-circuitdesign strategy, with dynamic circuits used only sparinglyin SRAMs and other critical regions of the processor core.Power dissipation was a significant concern, and it was akey factor in the decision to adopt a predominantly static-circuit design approach. In addition, the SOI technology,
including uncertainties associated with the modelingof the floating-body effect [21–23] and its impact onnoise immunity [22, 24 –27] and overall chip decouplingcapacitance requirements [26], was another factor behindthe choice of a primarily static design style. Finally, thesize and logical complexity of the chip posed risks tomeeting the schedule; choosing a simple, robust circuitstyle helped to minimize overall risk to the projectschedule with most efficient use of CAD tool and designresources. The size and complexity of the chip alsorequired rigorous testability guidelines, requiring almostall cycle boundary latches to be LSSD-compatible formaximum dc and ac test coverage.
Another important circuit design constraint was thelimit placed on signal slew rates. A global slew rate limitequal to one third of the cycle time was set and enforcedfor all signals (local and global) across the whole chip.The goal was to ensure a robust design, minimizingthe effects of coupled noise on chip timing and alsominimizing the effects of wiring-process variability onoverall path delay. Nets with poor slew also were foundto be more sensitive to device process variations andmodeling uncertainties, even where long wires and RCdelays were not significant factors. The general philosophywas that chip cycle-time goals also had to include theslew-limit targets; it was understood from the beginningthat the real hardware would function at the desiredcycle time only if the slew-limit targets were also met.
The following sections describe how these designconstraints were met without sacrificing cycle time. Thelatch design is described first, including a description ofthe local clocking scheme and clock controls. Then thecircuit design styles are discussed, including a description
Figure 9
Global clock waveforms showing 20 ps of measured skew.
1.5
1.0
0.5
0.0
0 500 1000 1500 2000 2500
20 ps skew
Vol
ts (
V)
Time (ps)
IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. D. WARNOCK ET AL.
33
50Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Some Flip Flops have “hold” time ...
D
t_setup
CLK
t_hold
D must stay
stable here
D Q
CLK
Does flip-flop hold time affect operation of this
circuit? Under what conditions?
t_inv
What is the intended function of this circuit?
t_clk-to-Q + t_inv > t_holdFor correct operation.
51Tuesday, February 11, 14
UC Regents Fall 2008 © UCBCS 194-6 L6: Timing
32rd1
RegFile
32rd2
WE32wd
5rs1
5rs2
5ws
ExtRegDest
ALUsrcExtOp
ALUctr
32A
L
U
32
32
op
MemToReg
32Dout
Data Memory
WE32
Din
Addr
MemWr
Equal
RegWr
Equal
Control Lines
Combinational Logic
Clk
32
Addr Data
Instr
Mem
32D
PC
Q
32
32
+
32
32
0x4
PCSrc
32
+
32
op rs rt immediate016212631
E
x
t
e
n
d
Searching for processor critical path
52Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Searching for processor critical path1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Timing AnalysisWhat is the
smallest T that produces correct
operation?Must considerall connectedregister pairs.
?
Q. Why might I suspect this one?A. Very long wire on the path.
53Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Combinational paths for IBM Power 4 CPU
From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
netlist. Of these, 121 713 were top-level chip global nets,and 21 711 were processor-core-level global nets. Againstthis model 3.5 million setup checks were performed in latemode at points where clock signals met data signals inlatches or dynamic circuits. The total number of timingchecks of all types performed in each chip run was9.8 million. Depending on the configuration of the timingrun and the mix of actual versus estimated design data,the amount of real memory required was in the rangeof 12 GB to 14 GB, with run times of about 5 to 6 hoursto the start of timing-report generation on an RS/6000*Model S80 configured with 64 GB of real memory.Approximately half of this time was taken up by readingin the netlist, timing rules, and extracted RC networks, as
well as building and initializing the internal data structuresfor the timing model. The actual static timing analysistypically took 2.5–3 hours. Generation of the entirecomplement of reports and analysis required an additional5 to 6 hours to complete. A total of 1.9 GB of timingreports and analysis were generated from each chip timingrun. This data was broken down, analyzed, and organizedby processor core and GPS, individual unit, and, in thecase of timing contracts, by unit and macro. This was onecomponent of the 24-hour-turnaround time achieved forthe chip-integration design cycle. Figure 26 shows theresults of iterating this process: A histogram of the finalnominal path delays obtained from static timing for thePOWER4 processor.
The POWER4 design includes LBIST and ABIST(Logic/Array Built-In Self-Test) capability to enable full-frequency ac testing of the logic and arrays. Such testingon pre-final POWER4 chips revealed that several circuitmacros ran slower than predicted from static timing. Thespeed of the critical paths in these macros was increasedin the final design. Typical fast ac LBIST laboratory testresults measured on POWER4 after these paths wereimproved are shown in Figure 27.
SummaryThe 174-million-transistor !1.3-GHz POWER4 chip,containing two microprocessor cores and an on-chipmemory subsystem, is a large, complex, high-frequencychip designed by a multi-site design team. Theperformance and schedule goals set at the beginning ofthe project were met successfully. This paper describesthe circuit and physical design of POWER4, emphasizingaspects that were important to the project’s success in theareas of design methodology, clock distribution, circuits,power, integration, and timing.
Figure 25
POWER4 timing flow. This process was iterated daily during the physical design phase to close timing.
VIM
Timer files ReportsAsserts
Spice
Spice
GL/1
Reports
< 12 hr
< 12 hr
< 12 hr
< 48 hr
< 24 hr
Non-uplift timing
Noiseimpacton timing
Upliftanalysis
Capacitanceadjust
Chipbench /EinsTimer
Chipbench /EinsTimer
Extraction
Core or chipwiring
Analysis/update(wires, buffers)
Notes:• Executed 2–3 months prior to tape-out• Fully extracted data from routed designs • Hierarchical extraction• Custom logic handled separately • Dracula • Harmony• Extraction done for • Early • Late
Extracted units (flat or hierarchical)Incrementally extracted RLMsCustom NDRsVIMs
Figure 26
Histogram of the POWER4 processor path delays.
!40 !20 0 20 40 60 80 100 120 140 160 180 200 220 240 260 280Timing slack (ps)
Lat
e-m
ode
timin
g ch
ecks
(th
ousa
nds)
0
50
100
150
200
IBM J. RES. & DEV. VOL. 46 NO. 1 JANUARY 2002 J. D. WARNOCK ET AL.
47
Most wires have hundreds of picoseconds to spare.The critical path
54Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Power 4: Timing Estimation, Closure
Timing EstimationPredicting a
processor’s clock rate early in the
project
From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
55Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
Power 4: Timing Estimation, Closure
Timing ClosureMeeting
(or exceeding!) the timing estimate
From “The circuit and physical design of the POWER4 microprocessor”, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al.
56Tuesday, February 11, 14
UC Regents Fall 2013 © UCBCS 250 L3: Timing
1600 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 36, NO. 11, NOVEMBER 2001
Fig. 1. Process SEM cross section.
The process was raised from [1] to limit standby power.
Circuit design and architectural pipelining ensure low voltage
performance and functionality. To further limit standby current
in handheld ASSPs, a longer poly target takes advantage of the
versus dependence and source-to-body bias is used
to electrically limit transistor in standby mode. All core
nMOS and pMOS transistors utilize separate source and bulk
connections to support this. The process includes cobalt disili-
cide gates and diffusions. Low source and drain capacitance, as
well as 3-nm gate-oxide thickness, allow high performance and
low-voltage operation.
III. ARCHITECTURE
The microprocessor contains 32-kB instruction and data
caches as well as an eight-entry coalescing writeback buffer.
The instruction and data cache fill buffers have two and four
entries, respectively. The data cache supports hit-under-miss
operation and lines may be locked to allow SRAM-like oper-
ation. Thirty-two-entry fully associative translation lookaside
buffers (TLBs) that support multiple page sizes are provided
for both caches. TLB entries may also be locked. A 128-entry
branch target buffer improves branch performance a pipeline
deeper than earlier high-performance ARM designs [2], [3].
A. Pipeline Organization
To obtain high performance, the microprocessor core utilizes
a simple scalar pipeline and a high-frequency clock. In addition
to avoiding the potential power waste of a superscalar approach,
functional design and validation complexity is decreased at the
expense of circuit design effort. To avoid circuit design issues,
the pipeline partitioning balances the workload and ensures that
no one pipeline stage is tight. The main integer pipeline is seven
stages, memory operations follow an eight-stage pipeline, and
when operating in thumb mode an extra pipe stage is inserted
after the last fetch stage to convert thumb instructions into ARM
instructions. Since thumb mode instructions [11] are 16 b, two
instructions are fetched in parallel while executing thumb in-
structions. A simplified diagram of the processor pipeline is
Fig. 2. Microprocessor pipeline organization.
shown in Fig. 2, where the state boundaries are indicated by
gray. Features that allow the microarchitecture to achieve high
speed are as follows.
The shifter and ALU reside in separate stages. The ARM in-
struction set allows a shift followed by an ALU operation in a
single instruction. Previous implementations limited frequency
by having the shift and ALU in a single stage. Splitting this op-
eration reduces the critical ALU bypass path by approximately
1/3. The extra pipeline hazard introduced when an instruction is
immediately followed by one requiring that the result be shifted
is infrequent.
Decoupled Instruction Fetch.A two-instruction deep queue is
implemented between the second fetch and instruction decode
pipe stages. This allows stalls generated later in the pipe to be
deferred by one or more cycles in the earlier pipe stages, thereby
allowing instruction fetches to proceed when the pipe is stalled,
and also relieves stall speed paths in the instruction fetch and
branch prediction units.
Deferred register dependency stalls. While register depen-
dencies are checked in the RF stage, stalls due to these hazards
are deferred until the X1 stage. All the necessary operands are
then captured from result-forwarding busses as the results are
returned to the register file.
One of the major goals of the design was to minimize the en-
ergy consumed to complete a given task. Conventional wisdom
has been that shorter pipelines are more efficient due to re-
Floorplaning: Essential to meet timing.
(Intel XScale 80200)57Tuesday, February 11, 14
58Tuesday, February 11, 14
UC Regents Fall 2006 © UCBCS 152 L6: Performance
CPU time: Proportional to Clock PeriodQ. What ultimately limitsan architect’s ability to reduce clock period ?
TimeProgram
TimeOne Clock Period
∝
A. Clock-to-Q , setup times, 2-D floorplanning geometry.
Q. How can architects (not technologists) reduce clock period?A. Shorten the machine’s critical path.
Rationale: We measure each instruction’s
execution time in “number of cycles”. By shortening the period for
each cycle, we shorten execution time.
59Tuesday, February 11, 14
On Thursday
Pipeline design - with enough detail to do a design.
Have fun in section!
60Tuesday, February 11, 14