25
1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara, CA) ISSCC 2006 Instructor: Dr. S. M. Fakhraie Provided by: Nayere Ghobadi Fall 2006 Advanced VLSI Class Presentation

1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

Embed Size (px)

Citation preview

Page 1: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

1

A Dual-Core Multi-Threaded Xeon® Processor

with 16MB L3 CacheStefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan

Chang (Intel, Santa Clara, CA)ISSCC 2006

Instructor:

Dr. S. M. Fakhraie

Provided by: Nayere Ghobadi

Fall 2006

Advanced VLSI Class Presentation

Page 2: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

2

Outline

Multi-core processors Cache Xeon processors Dual-Core Multi-Threaded Xeon Processor

Features 16MB L3 cache Clock Generation and Distribution Voltage supplies Processor Package Front-side bus (FSB) Protection Temperature sensing

Summary and conclusion

Page 3: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

3

Multi-core Processors

Is one that combines two or more independent processors into a single package, often a single IC.

Exhibit some form of thread-level parallelism (TLP).

Diagram of an Intel Core 2 dual core processor (from[6])

Page 4: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

4

Multi-core Processors Cont.

Advantages:

1. Signals don’t have to travel off-chip, so cache coherency circuitry can operate at a much higher clock rate.

2. Require much less space than multi-chip designs.

3. Slightly less power than two coupled single-core processors.

Page 5: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

5

Multi-core Processors Cont.

Disadvantages:

1. In addition to OS support, adjustments to existing software are required to maximize utilization of the computing resources provided by multi-core processors.

2. Drive production yields down and they are more difficult to manage thermally.

Page 6: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

6

Cache

A temporary storage area where frequently accessed data can be stored for rapid access.

If the processor finds the desired memory location in the cache, This situation is known as a cache hit, otherwise it is cache miss. The proportion of accesses that result in a cache hit is known as the hit rate.

Diagram of a CPU memory cache (from[6])

Page 7: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

7

Cache Cont.

Multi-level caches:

There is a tradeoff between cache latency and hit rate.

Larger caches have better hit rates but longer latency.

So many computers use multiple levels of cache, with small fast caches backed up by larger slower caches.

Page 8: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

8

Xeon Processors

The Xeon is intel's brand name for its server-class PC microprocessor intended for multiple-processor machines.

Generally have more cache and support larger multiprocessor configurations than their desktop counterparts.

Xeon processor and logo (from[6])

Page 9: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

9

Dual-Core Multi-Threaded Xeon ProcessorFeatures

Two 64b cores. Each core has:

1. Two threads2. A unified 1MB L2 cache

16MB unified L3 cache A simple direct interface

between core and front-side bus (FSB) for minimizing:

1. L3 cache latency.2. External bus latency.

Block diagram (from[1])

Page 10: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

10

Dual-Core Multi-Threaded Xeon ProcessorFeatures Cont.

Caching FSB controller for handling:

1. Core arbitration.2. L3 cache accesses.3. External bus requests.

The processor die is 435mm2 with 1.328B transistors.

Operates at more than 3.0GHz from a 1.25V core supply.

Die micrograph (from[1])

Page 11: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

11

Dual-Core Multi-Threaded Xeon ProcessorFeatures Cont.

The worst-case power dissipation is 165W (power dissipation on a typical server workload is 110W).

65nm process Technology.

Eight copper interconnect layers.

Low-k carbon-doped oxide (k=2.9) inter-level dielectric.

65nm process technology summary

(from[1])

Page 12: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

12

16MB L3 Cache

6T SRAM Cell Read:

Precharge both bitlines high Raise wordline One of the two bitlines will be

pulled down by the cell Write:

Drive one bitline high, the other low

Raise wordline Bitlines overpower cell with

new value

bit bit_b

word

6T memory cell (from[5])

Page 13: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

13

16MB L3 Cache Cont.

256 data sub-arrays (64kB each). Each data sub-array stores 32 bits.

32 redundancy sub-arrays (68kB each). Each redundancy sub-array store 34 bits.

Is Composed of 6T memory-cells with the size of 0.624µm2.

Physical address is 40b wide. Only 0.8% of all array blocks are

powered up for each cache access for reducing active power.

L3 cache block (from[1])

Page 14: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

14

16MB L3 Cache Cont.

Sleep circuit Active mode:

Virtual Vss =Vss Full voltage swing.

Sleep mode: Virtual Vss = 250mV. Reducing the leakage by 2X.

Shut-off mode: NMOS shut-off device is turned off. Virtual Vss = Vcc/2. Reducing the leakage by 4X.

L3 cache sleep circuit and shut-off mode (from[1])

Page 15: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

15

Clock Generation and Distribution

The critical clocking features of this processor are: 1. multiple clock domains with different frequencies.2. dedicated core and uncore voltage domains.

Separate PLLs and clock distribution trees for each core and the associated L2 cache.

A third PLL for the uncore half-frequency clock. De-skew circuits controlled by on-die fuses reduce the

uncore clock skew to less than 11ps.

Page 16: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

16

Clock Generation and Distribution Cont.

System clock (BCLK) = 200MHz.

Cores clock (MCLK) = BCLK×N. MCLK can be more than 3.0GHz at a 1.25V core supply voltage (Vcore).

Uncore clock (SCLK) = 1/2MCLK. Using a separate uncore voltage supply (Vcache).

FSB clock (ZCLK) = BCLK×4 (quad pumping)Clock distribution map (from[2])

Page 17: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

17

Voltage Supplies

Three voltage supplies are used for:1. Two cores.

2. L3 cache together with the associated control logic.

3. The FSB I/O circuits.

Level shifters are used between voltage domains. A custom tool checks for presence and correct

connectivity of level shifters on all signals that cross voltage domain boundaries.

Page 18: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

18

Voltage Supplies Cont.

Voltage domains and power breakdown (from[1])

Page 19: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

19

Processor Package

The processor is flip-chip or Controlled Collapse Chip Connection (C4).

The processor die has 13164 C4 solder bumps. Is attached to a 12-layer (4-4-4) organic package with

an integrated heat spreader. The package has 604 pins. 238 pins are signal pins

and the rest are power and ground. The chip-level power distribution consists of a uniform

M8-M7 grid synchronized with the C4 power and ground bump array.

Page 20: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

20

Front-Side Bus (FSB)

Operates at 800MT/s. A symmetric pre-driver design for controlling the edge

rate to meet timing and signal integrity requirements:1. Dividing the FSB output (VOL to VOH) into six voltage

levels.

2. Each driven by an output driver segment with different RON value.

3. When a segment is enabled, it forms a parallel resistance to the previously enabled segments.

4. A new voltage level is generated, thus creating a stair-case-like waveform in every transition.

Page 21: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

21

Front-Side Bus (FSB) Cont.

Symmetric I/O pre-driver circuit (form[1])

Page 22: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

22

Protection

Using bit interleaving for adjacent cache lines To prevent multiple bit errors caused by a single upset event in the same cache line.

L3 data and tag arrays and L2 data array have Error-correction code (ECC) protection

L2 tag has parity checking. A dynamic 32-entries cache line disable mechanism

protects the L3 cache from erratic bits and infant mortality failures.

Page 23: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

23

Temperature Sensing

Three diodes for temperature sensing: One in each core. routed to an on-package

temperature-monitor chip. provide temperature data to the system for fan speed control.

One between the two cores. is routed to pins for system use.

A temperature sensor near the hot spot in each core, provides a digital temperature readout that is used in conjunction with operating-system power-state requests to make informed throttle and boost decisions.

Page 24: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

24

Summary and Conclusion

Dual-core multi-threaded Xeon processor in 65nm process Technology.

The processor is flip-chip (C4). Has Two 64b cores. Each core has Two threads and A

unified 1MB L2 cache. Has 16MB unified L3 cache Operates at more than 3.0GHz from a 1.25V core

supply Three voltage supplies The processor FSB Operates at 800MT/s

Page 25: 1 A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache Stefan Rusu, Simon Tam, Harry Muljono, David Ayers, Jonathan Chang (Intel, Santa Clara,

25

References

[1] S. Rusu, S. Tam, “A Dual-Core Multi-Threaded Xeon® Processor with 16MB L3 Cache”, IEEE ISSCC Tech. Digest, p118, 2006.

[2] S. Tam, J. Leung, “Clock Generation and Distribution of a Dual-Core Xeon® Processor with 16MB L3 Cache”, IEEE ISSCC Tech. Digest, p382, 2006.

[3] “Dual-Core Intel® Xeon® Processor 7100 Series Datasheet”, Intel Corporation, September 2006.

[4] S. Tam, et al., “Clock Generation and Distribution for the Third Generation Itanium® Processor,” Symp. VLSI Circuits, pp. 9-12, Jun., 2003.

[5] N. H. E. Weste, D. Harris, “ CMOS VLSI Design” ,Pearson Education Inc., 2005.

[6] Wikipedia, The free encyclopedia . Available: http://en.wikipedia.org/