View
220
Download
0
Tags:
Embed Size (px)
Citation preview
1
2
Farhan Mohamed Ali Jigar Vora
Sonali Kapoor Avni Jhunjhunwala
1st May, 2006Final Presentation
MAD MAC 525
Design Manager: Zack Menegakis
Design a crucial part of a GPU called the Multiply Accumulate Unit (MAC) which is revolutionizing graphics
3
Agenda Marketing – Jigar Project and Algorithm Description – Farhan Implementation Part I – Farhan Implementation Part II – Sonali Floorplan – Sonali Layout – Avni Verification – Avni Design Specifications – Avni Conclusion – Jigar
4
MarketingJigar
5
Purpose
MAD MAC 525 accelerates FP16 blending to enable true HDR graphics
Huh??
Marketing Description Implementing Floorplan Layout SpecificationsVerify
6
7
Beauty of High Dynamic Range
With HDR rendering, pixel intensity can extend beyond the range of traditional graphics
Nature doesn’t have a limited pixel intensity and neither should Computer Graphics
In other words:
Bright things can be really bright Dark things can be really dark And the details can be seen in both
Marketing Description Implementing Floorplan Layout SpecificationsVerify
8
Applications of HDR
Marketing Description Implementing Floorplan Layout SpecificationsVerify
9
Target Market Target Market Segment
Graphic chip manufacturers High speed DSP manufacturers CPU co-processors
Potential Customers
Marketing Description Implementing Floorplan Layout SpecificationsVerify
10
Design Comparison
Top 180nm graphics chip is the NVIDIA NV16. Highest speed only 250MHz 9 bit Integer precision
As games are becoming more advanced, they are in need of fast graphics chips
Conclusion:
Market Needs a FAST MAD MAC
Marketing Description Implementing Floorplan Layout SpecificationsVerify
11
Description and Implementation IFarhan
12
• Multiply Accumulate unit (MAC)
• Executes function AB+C on 16 bit floating point inputs.
• Format – 1 bit sign, 5 bit exponent and 10 bit significand
• Multiply and add in parallel to greatly speed up operation
• Rounding performed only once so greater accuracy than individual multiply and add functions.
• Also known as:
• Fused Multiply Add (FMA)
• Multiply Add (MAD/MADD) in graphics shader programs
Project Description
Marketing Description Implementing Floorplan Layout SpecificationsVerify
13
Algorithm FP Multiply (A*B)
Multiply significands Add exponents Normalize Round
FP Add (A+B) Align smaller number to larger number Add significands Normalize Round
Marketing Description Implementing Floorplan Layout SpecificationsVerify
14
Algorithm
FP Multiply-Add (AB+C) Align sig C based on exp A+B-C Multiply significands A and B Add sig A*B result to aligned sig C Normalize Round
Marketing Description Implementing Floorplan Layout SpecificationsVerify
15
A B C
MultiplierExp Calc Align
Adder
Normalize
Round
Ovf Checker
Leading 0 Anticipator
OutputY
Block Diagram
Marketing Description Implementing Floorplan Layout SpecificationsVerify
16
Implementation
Design target: 300MHz Speed is the design goal Ambitious target?
How we planned achieve this Fast Logic – parallelize ops as much as
possible Pipelining
Marketing Description Implementing Floorplan Layout SpecificationsVerify
17
Implementation Adder
Carry Select vs Carry Lookahead tree
Marketing Description Implementing Floorplan Layout SpecificationsVerify
18
Implementation Adder
Han-Carlson based carry lookahead adder 6 lookahead logic stages for 32 bit adder Less logic than a Kogge-Stone adder Less wiring than a Brent-Kung adder
Marketing Description Implementing Floorplan Layout SpecificationsVerify
19
Implementation Multiplier
Carry-Save Multiplier Avoids having ripple carry in every stage Enables regular and compact layout Easy to pipeline Final 10 bit add stage using carry lookahead
adder
Marketing Description Implementing Floorplan Layout SpecificationsVerify
20
Implementation
Leading Zero Anticipator Predicts number of shifts to do in normalize
Normalize begins with zero delay Operates in parallel with adder so normalize shifts
can be predicted with accuracy of 1 shift to left or right
Marketing Description Implementing Floorplan Layout SpecificationsVerify
21
Implementation Latches
Pulse Latches Practically eliminates setup time
16 transistors per pulse generator Simplified version of those used in a certain high
speed CPU
Clock pulse generator
Marketing Description Implementing Floorplan Layout SpecificationsVerify
22
Implementation II and FloorplanSonali
23
Design Decision: Pass Logic Extensive use of Pass Logic
Reduces transistor count Reduces area
Transistor count reduced from 20,200 to 12,800 Example Normalize: 3400 -> 942 Align: 1500 -> 530
Ensure all pass logic is buffered
Marketing Description Implementing Floorplan Layout SpecificationsVerify
24
Design Decision: Pipelining
Initially planned 6 pipeline stages
Reduced to 4 pipeline stages Adder – Fast Carry Lookahead
architecture Multiplier – Ripple Carry to Carry
Lookahead
Marketing Description Implementing Floorplan Layout SpecificationsVerify
25
Pipeline Stages
Multiplier
Align C
Reg A
ExpCalc
Reg C
AdderLd
Zero
Normalize
Round
Reg B
Output
Marketing Description Implementing Floorplan Layout SpecificationsVerify
26
Schematics
Marketing Description Implementing Floorplan Layout SpecificationsVerify
Multiplier
I N P U T S
PIPELINE
O U T P U T S
OUTPUTS
P I P E L I N E
27
Schematic
Adder
INPUTS
OUTPUTS
Look Ahead Logic
Look Ahead Logic
Look Ahead Logic
Look Ahead Logic
Look Ahead Logic
Look Ahead Logic
Marketing Description Implementing Floorplan Layout SpecificationsVerify
Sum Logic
28
Multiplier
Align C
Reg A
Reg
BExpCalc
Reg C
Pipeline Reg Pipeline Reg
AdderLd
Zero
Pipeline Reg
NormalizeRound
Initial Floorplan
Reg YOverflow checker
Floorplan Evolution
Marketing Description Implementing Floorplan Layout SpecificationsVerify
29
Floorplan Evolution
Exponents
AlignLd
zero
Adder
MultiplierNormalize
Round
Ovf
Reg B
Ou
tpu
t
Reg A Reg C
Final Floorplan
Marketing Description Implementing Floorplan Layout SpecificationsVerify
30
Layout, Verification & SpecificationAvni
31
Layout Decisions
3 cell heights – 6.03, 5.04 and 3.55 Uniform width vdd and ground rails Wider vdd and ground rails in power
hungry modules Max of 8 latches per clock pulse
generator Uniform metal directionality within
each block
Marketing Description Implementing Floorplan Layout SpecificationsVerify
32
Final Layout
Marketing Description Implementing Floorplan Layout SpecificationsVerify
33
Final Layout
MULTIPLIER
Marketing Description Implementing Floorplan Layout SpecificationsVerify
34
Multiplier
Height: 191.6 Width: 206.38 Area: 20,388
I NIN
PIPELINE
REG
OUTPUT
O U T P U T
Marketing Description Implementing Floorplan Layout SpecificationsVerify
BIT
SLICE
35
Final Layout
MULTIPLIER
ADDER
Marketing Description Implementing Floorplan Layout SpecificationsVerify
36
Adder
A D D E R
INCREMENTER
Height:122.9Width: 110.2Area:13,202
Marketing Description Implementing Floorplan Layout SpecificationsVerify
37
Final Layout
Exponents
AlignLdzero
Adder
MultiplierNormalize
Round
Ovf
Input
Input
OUT
Marketing Description Implementing Floorplan Layout SpecificationsVerify
38
Layer Masks
Marketing Description Implementing Floorplan Layout SpecificationsVerify
Active: 14.04%
39
Layer Masks
Poly : 9.25%
Marketing Description Implementing Floorplan Layout SpecificationsVerify
40
Layer Masks
Metal 1 : 34.08%
Marketing Description Implementing Floorplan Layout SpecificationsVerify
41
Layer Masks
Metal 2 : 18.00%
Marketing Description Implementing Floorplan Layout SpecificationsVerify
42
Layer Masks
Metal 3 : 14.99%
Marketing Description Implementing Floorplan Layout SpecificationsVerify
43
Layer Masks
Metal 4 : 6.23%
Marketing Description Implementing Floorplan Layout SpecificationsVerify
44
Verification Of Design
Behavioral and Structural Verilog Extensive Testing – Unable to find C or
Matlab Code
Schematic and Layout testing Analog Simulations – Compare Output
with Behavioral
Full Chip Verification
Marketing Description Implementing Floorplan Layout SpecificationsVerify
45
Design Specifications
Critical path delay = 2.25ns Clock speed = 400MHz Pipeline stages = 4 Height by width = 195.26 um *
303.255 um Area = 59,214 um^2 Aspect ratio = 1:1.55 Transistor density = 0.22 Total Pin Count = 67
Marketing Description Implementing Floorplan Layout SpecificationsVerify
46
Schematic Power: mW (400 MHz)
Layout Power: mW (400 MHz)
Schematic Power: mW (100 MHz)
Layout Power: mW (100 MHz)
Multiplier -w/ pipeline
2.281 2.354 0.6168 0.6297
Exponents 0.3514 0.4094 0.0875 0.1029
Align 0.0782 0.0926 0.0278 0.0324
Adder 4.471 4.896 1.118 1.232
Leading 0 0.1313 0.1722 0.033 0.0433
Normalize 0.5865 0.6238 0.1481 0.1692
Round 0.6339 0.6782 0.1593 0.1709
OvfCheck 0.1632 0.1666 0.0408 0.04165
Total 12.25 13.008 3.065 3.297
Marketing Description Implementing Floorplan Layout SpecificationsVerify
47
Area:um2
Transistor Count
TransistorDensity
Schematic Delay (ns)
Layout Delay (ns)
Multiplier-w/ pipeline 20388 4496 0.22
3.381.9
N/A2.25
Exponents 5,163 738 0.14 1.01 1.2
Align 3,995 500 0.13 0.480 0.637
Adder 13,202 3174 0.24 1.34 1.7
Leading 0 1,253 364 0.29 0.506 0.551
Normalize 3,190 942 0.3 0.407 0.437
Round 1,802 494 0.28 0.864 0.986
OvfCheck 200 70 0.35 0.453 0.475
Registers, etc
N/A 2038 N/A 0.179 0.193
Total 59,214 12,820 0.22 - -Marketing Description Implementing Floorplan Layout SpecificationsVerify
48
ConclusionJigar
49
Graphics – HDR Rendering, Blending and Shader ops
• Fastest 180nm GPU: 250 MHz (9-bit Int)
• MAD MAC 525: 400 MHz (16-bit FP)
Everyone Needs a MAD MAC
Marketing Description Implementing Floorplan Layout SpecificationsVerify
50
DSPs – Computing Vector Dot-Products in Digital Filters
Everyone Needs a MAD MAC
Marketing Description Implementing Floorplan Layout SpecificationsVerify
51
Enables Fast Division, Square Root
• Eliminates extra Hardware to handle such computation
• Available in many new CPUs such as STI’s Cell
Everyone Needs a MAD MAC
Marketing Description Implementing Floorplan Layout SpecificationsVerify
52
Future Enhancements
16 to 32 Bits
Newer process technology
Possible modifications for low power apps
Marketing Description Implementing Floorplan Layout SpecificationsVerify
53
MAD MAC 525Everyone Wants A