Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
An Integrated Quad-Core AMD Opteron™MicroProcessor
S. Searles1, J. Dorsey1, M. Ciraula2, S. Johnson1, N. Bujanos1, D. Wu1, M. Braganza1, S. Meyers1,
E. Fang3, R. Kumar3
1AMD, Austin, TX; 2AMD, Fort Collins, CO; 3AMD, Sunnyvale, CA
OutlineTechnology Highlights
Major Features
Clock/Voltage Domains and PLLs
Cache Design Choices
DDR DRAM I/O
HyperTransportTM (HT) I/O
Thermal Monitoring
Mixed Signal Design Study (DDR)2
Technology
65nm, SOI CMOS, 0.8-1.4V
Dual strain liners + eSiGe
Floating and Body tied
11 Cu layers + low-k
1X, 1.3X, 2X, 4X thick Cu
F04 inv delay, 15ps800nmM11800nmM10400nmM9400nmM8400nmM7400nmM6270nmM5270nmM4270nmM3200nmM2200nmM1270nmContacted PolyPitchLayer
3
NM
OS
PM
OS
4
Cross Sections
Notable Attributes
• Monolithic, Quad x86 cores + Northbridge• Symmetric cores, same DRAM/HT latency• Enhanced 128b FPU• Shared 32Way, 2MB L3• Each core has an exclusive 16Way, 0.5MB L2• Cores have 2Way, 64kB L1 I and D Caches• 2 DDR channels, 4 DIMM/Channel, DDR2+3• Enhanced HT links for 2 to 8Way MP5
Die Overview
PLL
Thermal
128-bit FPU
L1 Data Cache
L1 Instr Cache
Load/ Store
512kB L2 Cache
L2 CtlFetch/
Decode/ Branch
Execution
2MB Shared L3 Cache
Northbridge
DDR
PHY
HT PHY, link 1
HT PHY, link 4
HT
PHY,
link
2H
T PH
Y, li
nk 3
Fuses
Fuses
Core 2
Core 3Core 4
Slow I/O
Slow I/O
• 463 Million Transistors
• 283mm2
• 2.0GHz & up• 0.8 to 1.4V
6
7
Core/L2 OverviewCore L2
Voltage/Clock Domains
–VDDCORE: 0.8V-1.4V• Core and L2: 2.0GHz and up
–VDDNB: 0.8V-1.4V• Northbridge and L3: 75% of core
–VLDT: 1.2V• HyperTransport links
–VDDIO: 1.8V (VTT:0.9V)• DDR I/O
–VDDA: 2.5V• PLLs (10 across the die) + Thermal
Multiple supplies for power optimization and isolation
8
Clock Domains
Regional clock domains: lower power, tighter skew (12ps)
Per core, independent PLL running at 2+GHz
Independent Northbridge (NB) PLL running at ¾*core
L3 uses the NB clock but at arbitrary phase
DDR and HT interfaces have independent PLLs
Source Synchronous clocked Buses connect I/O to NB.
NB-2-core & core-2-NB communication: 16 entry FIFO.1. Asynchronous : Max performance: ~2 entry pointer separation.2. Synchronous: Provides deterministic operation for ATE.
9
Domain Crossing
•Domains have: different PLLs (wander) and voltages•Wide operating range: 0.7 to 1.5V on either side•Improved latency by embedding into a dynamic FIFO•Precharge to VDD_B, evaluate with VDD_A
10
•PowerOK asserting: Valid voltage to all domains is present
•PowerOK signal originates in the VDDIO domain of the I/Os
•Propagates around the chip in VDDNB
•PowerOK passes from NB to all domains
•Power sniffer cell is designed with hysteresis to prevent spurious switching on power up.
Power OK Distribution
11
•All PLLs run off of one 200MHz Reference Clock– Distributed by a binary tree of specially filtered repeaters– Low-pass power-supply filter, 2Mhz pole– Reduces feed-through jitter at PLL output: higher Fmax– 500ps skew target at 0.8V: deterministic test
12
Noise Filtered Ref Clock
• Accepts 100 or 200 MHz Ref Clock in
• Feedback divider from 1 to 64 (all codes)
• Spine divider (1,2,3,4,6,128,512) post VCO
• Deterministic Spine divider changes (P-state)
• Duty cycle correction on output to spine
• Clock shrink/stretch for speedpath debug13
PLL Features
Positive feedback delay control
Center frequency is R dominated
Constant swing output
IpIn
QpQn
FILTp FILTn
OUTpOUTn
I
Rload
INp
INn
Rload
14
PLL VCO
PLL Close Loop Responce Gain, Phase vs Frequency
-60.0E+0-57.0E+0-54.0E+0-51.0E+0-48.0E+0-45.0E+0-42.0E+0-39.0E+0-36.0E+0-33.0E+0-30.0E+0-27.0E+0-24.0E+0-21.0E+0-18.0E+0-15.0E+0-12.0E+0-9.0E+0-6.0E+0-3.0E+0
000.0E+03.0E+06.0E+09.0E+0
12.0E+0
10.0E+3 100.0E+3 1.0E+6 10.0E+6 100.0E+6 1.0E+9
Frequency (Hz)
Gai
n (d
B)
-180.0
-165.0
-150.0
-135.0
-120.0
-105.0
-90.0
-75.0
-60.0
-45.0
-30.0
-15.0
0.0
Hmag(f)
Hphase(f)
PLL Closed Loop Response
15
Domino Pull Down
Month Day, Year
Precharge Left
Data
166T
Cells
166T
Cells
DataX
Precharge Right
BLX
BL
1.06um2• Single
Ended
• Sink and Source for Writes
• Local and Global Bit lines
16
L1 Cache Bit Slice
• Single Ended for stability
• Precharge is self timed off Write
• Fuse control
• Allows ECC on partial writes
Read PrechargeWrite
Word Line
Precharge
DataX
BL
Data
BLX
Clock
17
L1 Timing: Read Modified Write
PCH<1:0> COLSEL<7:0> PCHL<1:0>
WR
WRXSuperBL
<3:0>
<3:0>
<3:0>
<3:0>
<3:0>
<3:0> <3:0>
<3:0>
<3:0>
<3:0>
<3:0>
<3:0>
<3:0>
<3:0>
<3:0>
<3:0>
RA
M A
RR
AY
32 R
ows
x 4
Col
s
0.81um2
RA
M A
RR
AY
32 R
ows
x 4
Col
s
• Like L1, Single Ended for stability
• 3 Level Bit Line
18
L2 & L3 Cache Bit Slice
CLK
RdEn
RdIndex
HitWay
RdData
CaptureClk
0 1 2 3 4 5
1.5 Cycles
1.5 Cycles
1 Cycle
• Flexible tiling methodology: liquid L3
• Asynchronous timing for flexible layout
L3 Timing
19
0
1
2
3
4
5
6
7
0.7V 0.9V 1.1V 1.3V
L2/L3 Cells
Sig
ma
• Product Reliability requires > 5 sigma margin• Single-ended vs. small-swing: better margin
20
L2 & L3 Read Disturb Stability
No SOIMemory
no flood (no flip) 7.9σ
flood (no flip) 7.2σ
no flood (flip) 5.48σ
flood (flip) 4.53σ
•Study small swing (sense-amplifier-based)
•Increased ΔVT & ΔL variation + SOI Hysteresys
With SOIMemory
21
L2 & L3 Small Swing Study
RptrWptrDQ
DLL
PipelineControl
DQDLL
DQ[7:0]
DQS/DM[3:0]
DQ D +−
+−
DLLx2 MemVref
PLL
Supports DDR2 & 3, 1.4 to 1.9V, 400 to 1600Mb/s
2 independent (R/W,W/R,R/R,W/W), 64b channels (+ECC)
4 DIMM/chan (U,R,SO), per-DIMM trained delay settings.
44 DLL/chan with dynamic delay update (per burst).
22
DDR DRAM I/O
• DDR 800 (1250ps UI)
• Real Motherboard
• Under the Socket
• +/– 300mV margin
0 20% 40% 60% 80% 100%
600
400
200
0
-600
-400
-200
23
DDR DRAM Read Eye
DDR 800, no Channel, ATE Trigger on gold clock
DDR 800, at the DIMM, real traffic, Trigger on 0V Strobe crossing
24
DDR DRAM Write Eye
HyperTransportTM I/O
• Backward Compatible (legacy) to HT1– 400, 800, 1200, 1600 and 2000 MT/s
• Supports all HT3 speed– 2.4, 2.8, 3.2, 4.0, 4.4, 4.8 and 5.2 GT/s
• Support various power saving modes• Tolerates large HT3 common mode variation• Linear TX equalization (de-emphasis)• Non-linear RX equalization (DFE)
25
• Clock forwarded
• Simple DLL CDR
• SSC tolerant
26
HT Block Diagram
Data
Data
De-emph enable
De-emph enable
Out
OutX
• Primary Voltage mode Driver
• Eq done with current sources
• Lower power when > 3dB of boost
27
HT Driver with Linear Eq
HT RX with DFE
28
Remote Sensor Remote
SensorRemote Sensor
Remote Sensor
Remote Sensor
Remote Sensor
Remote Sensor
Remote Sensor
TCEN
TCEN
Remote Sensor
Remote Sensor
Remote Sensor
Remote Sensor
Remote Sensor Remote
SensorRemote Sensor
Remote Sensor
Remote Sensor
Remote Sensor
Remote Sensor
Remote Sensor
TCEN
Remote SensorRemote
SensorRemote Sensor
Remote Sensor
Remote Sensor
Remote Sensor
Remote Sensor
Remote Sensor
TCEN
Remote SensorRemote
SensorRemote Sensor
Remote Sensor
Remote Sensor
Remote Sensor
Remote Sensor
Remote Sensor
TCEN
TCON
Remote Sensor
Remote Sensor
CORE 0
CORE 1 CORE 3
CORE 2
NorthBridge
Thermal: Chip Wide Block Diagram
29
ExternalDiode
ThermCenter
ThermCenter
RemoteDiodeSensor
30
Thermal Sensor Locations
0
1
7
…
Reference I & V
I I*9
ΣΔA/D
CMP/Latch
Control Logic
Temp[8:0]Local Osc(20MHz)
Regulators2.5V
Regulated Supply for Digital Circuits
Regulated Supply for Analog Circuits
LevelShifters
+-
+-
ΔVbe &Scaler
198μV = 1C
31
Thermal: Block Diagram
Design Study (DS): DDR Phy• Advice for Mixed-Signal IC designers
• Technology does not care about you
• When you overload functions, beware
• Always have software hooks (use all modes)
• Poly (or other) fuses are your friends
• Put Defaults in metal (one layer to change)
• Use servo loops to set bias conditions
• Use DACs for the servo loop reference32
DS DDR: Register Space• Use scalar bits if you have less than 50 bits
• Otherwise, use a simple bus and register
• Physical locality to where the bits are used
• Extensible: Easy to add bits and functions
• Add in write masks and broadcasts
• It will save your bacon
33
34
DS DDR: Registers
D Qck
Decode
16
1616
16 RdDataOut
WrData
RdStrobe
WrStrobeAddress
RdDataIn
4
16
NibbleMask 4
Register
Write1 Write2 Read1 Read2 Write3 Null Read3AddressMask
RdStrobe
WrStrobe
RdData
WrData WrDat1 WrDat2 WrDat3
RdDat1 RdDat2 RdDat3
DS DDR: Servo Loop• Use feedback to deal with PVT (servo)
• Use DACs to set the servo reference
35
Rref
+−4 IDAC
VTT
VDDIO
Iref
Rload Rload
Register
Itail
15·W
15·W
8·W
8·W
C[3]
4·W
4·W
C[2]
2·W
2·W
C[1]
1·W
1·W
C[0]
IDAC
Summary• Monolithic, Quad x86 cores
• 65nm SOI CMOS, 11 Cu layers
• Integrated Northbridge
• Enhanced 128b FPU
• 2MB shared L3
• 2 independent DDR channels, DDR2 & 3
• Enhanced HT links for 2 to 8Way MP
• Comprehensive thermal monitoring36