Computer Structure 2014 – Advanced Topics 1 Computer Structure Advanced Topics Computer Structure Advanced Topics Lihu Rappoport and Adi Yoaz

Computer Structure 2014 – Advanced Topics 1

Computer Structure

Advanced Topics

Lihu Rappoport and Adi Yoaz


Intel® CoreTM μArch


Haswell builds upon innovations in the 2nd and 3rd Generation Intel® Core™ i3/i5/i7 Processors

(Sandy Bridge and Ivy Bridge)

Nehalem

45nm Process Technology

TOCK

1st Generation Intel® Core™ New Intel® μarch(Nehalem)

3rd Generation Intel® Core™ Intel® μarch(Sandy Bridge)

Ivy Bridge


TICK

Tick/Tock Development Model

TOCK

Haswell

4th Generation Intel® Core™ New Intel® μarch(Haswell)

Haswell CPU


1st Generation Intel® Core™ Intel® μarch(Nehalem)


TICK TOCK

Westmere Sandy Bridge

2nd Generation Intel® Core™ New Intel® μarch(Sandy Bridge)

Foil taken from IDF 2012


5th Generation Intel CoreTM Processor

· 14nm Process Technology· Broadwell micro-architecture


Haswell Core at a GlanceNext generation branch prediction

• Improves performance and saves wasted work

Improved front-end

• Initiate TLB and cache misses speculatively

• Handle cache misses in parallel to hide latency

• Leverages improved branch prediction

Deeper buffers

• Extract more instruction parallelism

• More resources when running a single thread

More execution units, shorter latencies

• Power down when not in use

More load/store bandwidth

• Better prefetching, better cache line split latency & throughput, double L2 bandwidth

• New modes save power without losing performance

No pipeline growth

• Same branch misprediction latency

• Same L1/L2 cache latency


DecodeDecode

μop Queueµop Allocation

Out-of-Order Execution

µop-Cache Tag

I-cacheTag

Branch Prediction

ITLB

µop-Cache Data

I-cache Data

1 2 3 4 5 6 70


Branch Prediction Unit· Predict branch targets

– Direct Calls and Jumps – target provided by a Target Array

– Indirect Calls and Jumps – predicted either as having a fixed target or as having targets that vary based on execution path

– Returns – predicted by a 16 entry Return Stack Buffer (RSB)

· For conditional branches– predict if taken or not

· The BPU makes predictions for 32 bytes at a time– Twice the width of the fetch engine– Enables taken branches to be

predicted with no penalty

From the Optimization Manual / IDF 2012

DecodeDecode



µop-Cache Tag

I-cacheTag

Branch Prediction

ITLB

µop-Cache Data

I-cache Data

1 2 3 4 5 6 70


Instruction TLB· Initiate TLB and cache

misses speculatively– Handle cache misses in

parallel to hide latency– Leverages improved branch

prediction

DecodeDecode



µop-Cache Tag

I-cacheTag

Branch Prediction

ITLB

µop-Cache Data

I-cache Data

1 2 3 4 5 6 70

From the Optimization Manual / IDF 2012


Instruction Fetch and Pre-decode

• 32KB 8-way I-Cache– Fetches aligned 16 bytes per cycle

Typical programs average ~4 bytes per instruction– A misaligned target into the line or a taken branch out of the

line reduces the effective number of instruction bytes fetched In typical integer code: a taken branches every ~10 instructions,

which translates into a partial fetch every 3~4 cycles

• The Pre-Decode Unit – Determines the length of each instruction– Decodes all prefixes associated with instructions– Mark various properties of instructions for the decoders

for example, “is branch”

32KB L1 I-Cache Pre decode Instruction Queue

DecodersDecodersDecodersDecodersμop

Queue

MSROM

From the Optimization Manual


Instruction Pre-decode and IQ

· The pre-decode unit writes 6 instructions/cycle into the IQ– If a fetch line contains >6 instructions

Continues to pre-decode 6 instructions/cycle, until all instructions in the fetch line are written into IQ

The subsequent fetch line starts pre-decode only after the current fetch completes

– A fetch of line of 7 instructions takes 2 cycles to pre-decode Average of 3.5 inst/cycle (still higher than IPC of most apps)

· Length changing prefixes (LCP) change the inst. length– Each LCP within the fetch line takes 3 cycle to pre-decode

· The Instruction Queue is 18 instructions deep



Queue

MSROM



Instruction Decode and μop Queue

• 4 Decoders, which decode instructions into μops– 1st decoder can decode all instructions of up to 4 μops– The other 3 decoders handle common single μop instructions

• Instructions with >4 μops generate μops from the MSROM

• Up to 4 μops/cycle are delivered into the μop Queue– Buffers 56 μops (28 μops / thread when running 2 threads)– Decouples the front end and the out-of order engine– Helps hiding bubbles introduced between the various sources

of μops in the front end



Queue

MSROM



Macro-Fusion

· The IQ sends up to 5 inst. / cycle to the decoders

· Merge two adjacent instructions into a single μop– A macro-fused instruction executes with a single dispatch

Reduces latency and frees execution resources– Increased decode, rename and retire bandwidth– Power savings from representing more work in fewer bits

· The 1st instruction modifies flags– CMP, TEST, ADD, SUB, AND, INC, DEC

· The 2nd instruction pair is a conditional branch

· These pairs are common in many types of applications

L1 I-Cache Pre decode DecodersDecodersDecodersDecodersμop

Queue

MSROM

Instruction Queue



Stack Pointer Tracker

ESP = SUB ESP, 8

STORE [ESP-8], EBX

STORE [ESP-4], EAX

PUSH EAX

PUSH EBX

INC ESP

ESP = ESP - 4

STORE [ESP], EAX

ESP = ESP - 4

STORE [ESP], EBX

ESP = ADD ESP, 1

Δ = Δ - 4

Δ = 0

Δ = - 4

Δ = - 8Δ = Δ - 4

ESP = ADD ESP, 1

Δ = 0Need to sync ESP !

· PUSH, POP, CALL, RET implicitly update ESP– Add or sub an offset, which would require a dedicated μop– The Stack Pointer Tracker performs implicit ESP updates

From the Optimization Manual and ITJ

· Provides the following benefits– Improves decode BW: PUSH, POP and RET become single μop instructions– Conserves rename, execution and retire resources– Remove dependencies on ESP – can execute stack operations in parallel– Saves power


Micro-Fusion· Fuse multiple μops from same instruction into a

single μop– Instruction which decode into a single micro-fused μop can be

handled by all decoders– Improves instruction bandwidth delivered from decode to

retirement and saves power– A micro-fused μop is dispatched multiple times in the OOO

As it would if it were not micro-fused

· Micro-fused instructions– Stores are comprised of 2 μop: store-address and store-data

Fused into a single μop– Load + op instruction

e.g., FADD DOUBLE PTR [RDI+RSI*8]– Load + jump

e.g., JMP [RDI+200]



Decoded μop-Cache

• Caches the μops coming out of the decoders– Up to 1.5K μops (32 sets × 8 ways × 6 μops/way)– Next time μops are taken from the μop Cache – ~80% hit rate for most applications– Included in the IC and iTLB, flushed on a context switch

• Higher Bandwidth and Lower Latency– More cycles sustaining 4 instruction/cycle

In each cycle provide μops for instructions mapped to 32 bytes Able to ‘stitch’ across taken branches in the control flow

μop-CacheBranch Prediction Unit

μopQueue


DecodersDecodersDecodersDecoders



Decoded μop-Cache

• Decoded μop Cache lets the normal front end sleep– Decode one time instead of many times

• Branch misprediction penalty reduced– The correct path is also the most efficient path

Save Power while Increasing Performance



μopQueue



Zzzz


Loop Stream Detector (LSD)

· LSD detects small loops that fit in the μop queue – The μop queue streams the loop, allowing the front-end to sleep– Until a branch miss-prediction inevitably ends it

· Loops qualify for LSD replay if all following conditions are met– Up to 28 μops, with 8 taken branches, 8 32-byte chunks– All μops are also resident in the μop-Cache– No CALL or RET– No mismatched stack operations (e.g., more PUSH than POP)


μopQueue



Zzzz

LSD



The Renamer

· Moves ≤4μops/cycle from the μop-queue to the OOO– Renames architectural sources and destinations of the μops to

micro-architectural sources and destinations– Allocates resources to the μops, e.g., load or store buffers– Binds the μop to an appropriate dispatch port– Up to 2 branches each cycle

· Some μops are executed to completion during rename,

effectively costing no execution bandwidth– A subset of register-to-register MOV and FXCHG– Zero-Idioms– NOP

Scheduler

Allocate/Rename/RetireLoad Buffers

Store Buffers

ReorderBuffers

In order

OOO

μopQueue



Dependency Breaking Idioms· Zero-Idiom – an instruction that zeroes a register

– Regardless of the input data, the output data is always 0– E.g.: XOR REG, REG and SUB REG, REG– No μop dependency on its sources

· Zero-Idioms are detected and removed by the Renamer– Do not consume execution resource, have zero exe latency

· Zero-Idioms remove partial register dependencies– Improve instruction parallelism

· Ones-Idiom – an instruction that sets a register to “all 1s”– Regardless of the input data the output is always "all 1s" – E.g., CMPEQ XMM1, XMM1;

No μop dependency on its sources, as with the zero idiom Can execute as soon as it finds a free execution port

– As opposed to Zero-Idiom, the Ones-Idiom μop must execute



Out-of-Order Cluster

• The Scheduler queues μops until all source operands are ready – Schedules and dispatches ready μops to the available execution

units in as close to a first in first out (FIFO) order as possible

• Physical Register File (PRF) – Instead of centralized Retirement Register File

Single copy of every data with no movement after calculation

– Allows significant increase in buffer sizes Dataflow window ~33% larger

• Retirement– Retires μops in order and handles faults and exceptions

Scheduler


Store Buffers

ReorderBuffers

In order

Out-of-order

FP/INT Vector PRF Int PRF



Intel® Advanced Vector Extensions· Vectors are a natural data-type for many apps

– Extend SSE FP instruction set to 256 bits operand size – Extend all 16 XMM registers to 256bits

• New, non-destructive source syntax– VADDPS ymm1, ymm2, ymm3

• New Operations to enhance vectorization– Broadcasts, masked load & store

YMM0 – 256 bits (AVX)XMM0 – 128 bits

Wide vectors+ non-destructive source: more work with fewer instructions Extending the existing state is area and power efficient



Execution Cluster

Scheduler sees matrix:• 3 “ports” to 3 “stacks”

of execution units– General Purpose

Integer– SIMD (Vector)

Integer– SIMD

Floating Point

• Challenge: double the output of one of these stacks in a manner that is invisible to the others

Port 0

Port 1

Port 5

GPR SIMD INT SIMD FP

ALU

ALU

ALU

JMP

VI ADD

VI MUL

VI Shuffle

VI Shuffle

FP Shuf

FP Bool

DIV

FP ADD

Blend

Blend

FP MUL


Computer Structure 2014 – Advanced Topics 22 Foil taken from IDF 2011

Port 0

Port 1

Port 5

GPR SIMD INT SIMD FP

ALU

ALU

ALU

JMP

VI ADD

VI MUL

VI Shuffle

VI Shuffle

FP Shuf

FP Bool

DIV

FP ADD

Blend

Blend

FP MUL

Execution Cluster – 256 bit VectorsSolution:• Repurpose existing data

paths to dual-use

• SIMD integer and legacy SIMD FP use legacy stack style

• Intel® AVX utilizes both 128-bit execution stacks

• Double FLOPs– 256-bit Multiply +

256-bit ADD + 256-bit Load per clock

FP ADD

FP Multiply

FP Blend

FP Shuffle

FP Boolean

FP Blend

“Cool” Implementation of Intel AVX256-bit Multiply + 256-bit ADD + 256-bit Load per clock…

Double your FLOPs with great energy efficiency


Haswell AVX2: FMA & Peak FLOPS

0

10

20

30

40

50

60

70

0 5 10 15 20 25 30 35Peak FLOPS/clock

Peak

Cach

e b

yte

s/cl

ock

BaniasMerom

Sandy Bridge

Haswell· 2 new FMA units provide

2× peak FLOPs/cycle– Fused Multiply-Add

· 2× cache bandwidth to feed wide vector units– 32-byte load/store for L1– 2× L2 bandwidth

· 5-cycle FMA latency same as an FP multiply

Latency (clks)Prior Gen Haswell Ratio

MulPS, PD 5 5

AddPS, PD 3 3

Mul+Add /FMA 8 5 1.6


Instruction Set SP/cyc

DP/cyc

Nehalem SSE (128-bits) 8 4

SandyBridge AVX (256-bits) 16 8

Haswell AVX2 & FMA 32 16


Haswell Execution Units

FMAFP Multiply

2×FMA• Doubles peak FLOPs• Two FP multiplies

benefits legacy

Unified Reservation Station

Port 1

Port 2

Port 3

Port 4

Port 5

Load &Store Address

StoreData

Integer ALU & Shift

IntegerALU & LEA

Integer ALU & LEA

FMA FP MultFP Add

Divide

Port 6

Integer ALU & Shift

Port 7

Store Address

Port 0

New AGU for Stores• Leaves Port 2 & 3

open for Loads

Branch

New Branch Unit• Reduces Port0 Conflicts• 2nd EU for high branch code

4th ALU• Great for integer workloads• Frees Port0 & 1 for vector

VectorShuffle

Branch

Vector Int Multiply

VectorLogicals

Vector Shifts

Vector IntALU

Vector IntALU

VectorLogicals

VectorLogicals

Intel® Microarchitecture (Haswell)



Memory Cluster

· The L1 D$ handles 2× 256-bit loads + 1× 256-bit store per cycle

· The ML$ can service one cache line (64 bytes) each cycle

· 72 load buffers keep load μops from allocation till retirement– Re-dispatch blocked loads

· 42 store buffers keep store μops from allocation till the store value is written to L1-D$ – or written to the line fill buffers – for non-temporal stores

32KB L1 D$

Store Data

Fill Buffers

Memory Control

Load/Store

Address

Load/Store

Address

Store Address

256KB ML$

32×3 bytes/cycle



L1 Data Cache· Handles 2× 256-bit loads + 1× 256-bit store per cycle

· Maintains requests which cannot be completed– Cache misses– Unaligned access that splits across cache lines– Data not ready to be forwarded from a preceding store– Load block due to cache line replacement

· Handles up to 10 outstanding cache misses in the LFBs– Continues to service incoming stores and loads

· The L1 D$ is a write-back write-allocate cache– Stores that hit in the L1-D$ do not update the L2/LLC/Mem – Stores that miss the L1-D$ allocate a cache line



Stores· Stores to memory are executed in two phases

· At EXE– Fill store buffers with linear + physical address and with data– Once store address and data are known

Forward Store data to load operations that need it

· After the store retires – completion phase– First, the line must be in L1 D$, in E or M MESI state

Otherwise, fetch it using a Read for Ownership request from

L1 D$ L2$ LLC L2$ and L1 D$ in other cores Memory– Read the data from the store buffers and write it to L1 D$ in M state

Done at retirement to preserve the order of memory writes– Release the store buffer entry taken by the store

Affects performance only if store buffer becomes full (allocation stalls) Loads needing the store data get when store is executed



Store Forwarding· Forward data directly from a store to a load that needs it

– Instead of having the store write the data to the DCU and then the load read the data from the DCU

· Store to load forwarding conditions– The store is the last store to that address, prior to the load

Requires addresses of all stores older than the load to be known– The store contains all data being loaded– The load is from a write-back memory type – Neither the load nor the store are non-temporal accesses– The load is not a 4 an 8 byte load that crosses 8 byte boundary,

relative to the preceding 16- or 32-byte store– The load does not cross a 16-byte boundary of a 32-byte store



Memory Disambiguation· A load may depend on a preceding store

– A load is blocked until all preceding store addresses are known

· Predict which loads do not depend on any previous stores– Forward data from a store or L1 D$ even when not all previous store

addresses are known– The prediction is verified, and if a conflict is detected

The load and all succeeding instructions are re-fetched– Always assumes dependency between loads and earlier stores that

have the same address bits 0:11

· The following loads are not disambiguated– Loads that cross the 16-byte boundary– 32-byte Intel AVX loads that are not 32-byte aligned



Data Prefetching· Two hardware prefetchers load data to the L1 D$

· DCU prefetcher (the streaming prefetcher)– Triggered by an ascending access to very recently loaded data– Fetches the next line, assuming a streaming load

· Instruction pointer (IP)-based stride prefetcher– Tracks individual load instructions, detecting a regular stride

Prefetch address = current address + stride Detects strides of up to 2K bytes, both forward and

backward



Core Cache Size/Latency/BandwidthMetric Nehalem Sandy Bridge Haswell

L1 Instruction Cache 32K, 4-way 32K, 8-way 32K, 8-way

L1 Data Cache 32K, 8-way 32K, 8-way 32K, 8-way

Fastest Load-to-use 4 cycles 4 cycles 4 cycles

Load bandwidth 16 Bytes/cycle32 Bytes/cycle

(banked)64 Bytes/cycle

Store bandwidth 16 Bytes/cycle 16 Bytes/cycle 32 Bytes/cycle

L2 Unified Cache 256K, 8-way 256K, 8-way 256K, 8-way

Fastest load-to-use 10 cycles 11 cycles 11 cycles

Bandwidth to L1 32 Bytes/cycle 32 Bytes/cycle 64 Bytes/cycle

L1 Instruction TLB4K: 128, 4-way

2M/4M: 7/thread4K: 128, 4-way

2M/4M: 8/thread4K: 128, 4-way

2M/4M: 8/thread

L1 Data TLB4K: 64, 4-way

2M/4M: 32, 4-way1G: fractured

4K: 64, 4-way2M/4M: 32, 4-way

1G: 4, 4-way

4K: 64, 4-way2M/4M: 32, 4-way

1G: 4, 4-way

L2 Unified TLB 4K: 512, 4-way 4K: 512, 4-way4K+2M shared:

1024, 8-way

All caches use 64-byte linesFoil taken from IDF 2012


Buffer Sizes

Nehalem Sandy Bridge Haswell

Out-of-order Window 128 168 192

Scheduling Window 36 54

In-flight Loads 48 64 72

In-flight Stores 32 36 42

Scheduler Entries 36 54 60

Integer Register File N/A 160 168

FP Register File N/A 144 168

Allocation Queue 28/thread 28/thread 56

Extract more parallelism in every generation



Haswell Core μArch

Scheduler


Store Buffers

ReorderBuffers

In order

OOO

μop Cache



Branch Prediction Unit

μopQueue

LSD

Port 4

32KB L1 D$

Port 2

Load

Store Address

Store Data

Fill Buffers

INT

Port 3

Load

Store Address

Port 7

Store Address

Port 0

JMP

ALU

ALU

DIVShift

Shift

FMA

Port 1

MUL

ALU

FMA

ShiftALU

LEA

Port 5

ALU

ShuffleALU

LEA

Port 6

JMP

ALU

Shift

VE

C Load Data 2Load Data 3 Memory Control

256KB ML$



Hyper Threading Technology


Thread-Level Parallelism· Multiprocessor systems have been used for many years

– There are known techniques to exploit multiprocessors

· Software trends– Applications consist of multiple threads or processes that can be

executed in parallel on multiple processors

· Thread-level parallelism (TLP) – threads can be from– the same application– different applications running simultaneously– operating system services

· Increasing single thread performance becomes harder– and is less and less power efficient

· Chip Multi-Processing (CMP)– Two (or more) processors are put on a single die



Multi-Threading Schemes· Multi-threading: a single processor executes multiple threads

· Time-slice multithreading – The processor switches between software threads after a fixed period– Can effectively minimize the effects of long latencies to memory

· Switch-on-event multithreading – Switch threads on long latency events such as cache misses – Works well for server applications that have many cache misses

· A deficiency of both time-slice MT and switch-on-event MT– They do not cover for branch mis-predictions and long dependencies

· Simultaneous multi-threading (SMT)– Multiple threads execute on a single processor simultaneously w/o switching– Makes the most effective use of processor resources

Maximizes performance vs. transistor count and power



Hyper-Threading Technology· Two logical processors in each physical processor

– Sharing most execution/cache resources using SMT– Look like two processors to SW (OS and apps)

· Each logical processor executes a software thread – Threads execute simultaneously on one physical processor

· Each LP has maintains its own architectural state– Complete set of architectural registers

general-purpose registers, control registers, machine state registers, debug registers

– Instruction pointers

· Each LP has its own interrupt controller – Interrupts sent to a specific LP are handled only by it



Two Important Goals

· When one thread is stalled the other thread can continue to make progress– Independent progress ensured by either

Partitioning buffering queues and limiting the number of entries each thread can use

Duplicating buffering queues

· A single active thread running on a processor with HT runs at the same speed as without HT – Partitioned resources are recombined when only one thread is active



Single-task And Multi-task Modes· MT-mode (Multi-task mode)

– Two active threads, with some resources partitioned as described earlier

· ST-mode (Single-task mode)– ST0 / ST1 – only thread 0 / 1 is active– Resources that are partitioned in MT-mode are re-combined to give the

single active logical processor use of all of the resources

· Moving the processor from between modes

ST0 ST1

MTThread 1 executes HALT

LowPower Thread 1 executes HALT

Thread 0 executes HALT

Thread 0 executes HALT

Interrupt



Thread Optimization

The OS should implement two optimizations:· Use HALT if only one logical processor is active

– Allows the processor to transition to either the ST0 or ST1 mode – Otherwise the OS would execute on the idle logical processor a sequence of

instructions that repeatedly checks for work to do – This so-called “idle loop” can consume significant execution resources that

could otherwise be used by the other active logical processor

· On a multi-processor system– OS views logical processors similar to physical processors

But can still differentiate and prefer to schedule a thread on a new physical processor rather than on a 2nd logical processors in the same phy processor

– Schedule threads to logical processors on different physical processors before scheduling multiple threads to the same physical processor

– Allows SW threads to use different physical resources when possible



Physical Resource Sharing Schemes· Replicated Resources

– Each thread has its own resource

· Partitioned Resources– Resource is partitioned in MT mode, and combined in ST mode

· Competitively-shared resource – Both threads compete for the entire resource

· HT unaware resources– Both threads use the resource

· Alternating Resources– Alternate between active threads– if one thread idle

the active thread uses the resource continuously



Front End Resource Sharing

· Instruction Pointer is replicated

· Branch prediction resources are either duplicated or shared– E.g., the return stack buffer is duplicated

· iTLB– The large page iTLB is replicated– The small page iTLB is partitioned

· The I$ is Competitively-shared

· The decoder logic is shared and its use is alternated– If only one thread needs the decode logic,

it gets the full decode bandwidth

· The μop-Cache is partitioned

· The μop Queue is partitioned· 56 μops in ST mode, 28 μops/thread in MT mode



Back End Resource Sharing· Out-Of-Order Execution

– Register renaming tables are replicated– The reservation stations are competitively-shared – Execution units are HT unaware – The re-order buffers are partitioned– Retirement logic alternates between threads

· Memory– The Load buffers and store buffers are partitioned– The DTLB and STLB are competitively-shared – The cache hierarchy and fill buffers are competitively-shared



Power Management


Processor Power Components· The power consumed by a processor consists of

– Dynamic power: power for toggling transistors and lines from 01 or 10 αCV2f : α – activity, C – capacitance, V – voltage, f – frequency

– Leakage power: leakage of transistors under voltage function of: Z – total size of all transistors, V – voltage, t – temperature

· Peak power must not exceed the thermal constrains– Power generates heat

Heat must be dissipated to keep transistors within allowed temperature

– Peak power determines peak frequency (and thus peak performance)– Also affects form factor, cooling solution cost, and acoustic noise

· Average power– Determines battery life (for mobile devices), electricity bill, air-condition bill– Average power = Total Energy / Total time

Including low-activity and idle-time (~90% idle time for client)


Performance per Watt· In small form-factor devices thermal budget limits performance

– Old target: get max performance– New target: get max performance at a given power envelope

Performance per Watt

· Increasing f also requires increasing V (~linearly)– Dynamic Power = αCV2f = Kf3 X% performance costs ~3X% power– A power efficient feature – better than 1:3 performance : power

Otherwise it is better to just increase frequency (and voltage)

· Vmin is the minimal operation voltage– Once at Vmin, reducing frequency no longer reduces voltage– At this point a feature is power efficient only if it is 1:1 performance : power

· Active energy efficiency tradeoff– Energyactive = Poweractive × Timeactive Poweractive / Perfactive

– Energy efficient feature: 1:1 performance : power


Platform Power· Processor average power is <10% of the platform

Display(panel + inverter)

33%

CPU10%

Power Supply10%

MCH9%

Misc.8%

GFX8%

HDD8%

CLK5%

ICH3%

DVD2%

LAN2%

Fan2%


Managing Power· Typical CPU usage varies over time

– Bursts of high utilization & long idle periods (~90% of time in client)

· Optimize power and energy consumption– High power when high performance is needed– Low power at low activity or idle

· Enhanced Intel SpeedStep® Technology– Multi voltage/frequency operating points– OS changes frequency to meet performance needs and minimize power– Referred to as processor Performance states = P-States

· OS notifies CPU when no tasks are ready for execution– CPU enters sleep state, called C-state– Using MWAIT instruction, with C-state level as an argument– Tradeoff between power and latency

Deeper sleep more power savings longer to wake


P-states· Operation frequencies are called P-states = Performance

states– P0 is the highest frequency – P1,2,3… are lower frequencies– Pn is the min Vcc point = Energy efficient point

· DVFS = Dynamic Voltage and Frequency Scaling– Power = CV2f ; f = KV Power ~ f3 – Program execution time ~ 1/f– E = P×t E ~ f2

Pn is the most energy efficient point

– Going up/down the cubic curve of power High cost to achieve frequency large power savings for

some small frequency reduction

P0

P1

Pn

Freq

Power

P2


C-States: C0· C0: CPU active state

Leakage

Clock Distribution

Local Clocks and Logic

Active Core Power


C-States: C1· C0: CPU active state· C1: Halt state:

• Stop core pipeline• Stop most core clocks• No instructions are executed• Caches respond to external snoops

Leakage

Clock Distribution

Active Core Power




· C3 state:• Stop remaining core clocks• Flush internal core caches

Leakage

Active Core Power




· C3 state:• Stop remaining core clocks• Flush internal core caches

· C6 state:• Processor saves architectural state• Turn off power gate, eliminating leakage Leakage

Core power goes to ~0

Active Core Power


Putting it all together

· CPU running at max power and frequency· Periodically enters C1

0

2

4

6

8

10

12

14

16

18

20

Pow

er [W

]

C1

C0P0

Time



· Going into idle period– Gradually enters deeper C states– Controlled by OS

0

2

4

6

8

10

12

14

16

18

20

Time

Pow

er [W

]

C2C3

C4C1

C0P0



· Tracking CPU utilization history– OS identifies low activity– Switches CPU to lower P state

0

2

4

6

8

10

12

14

16

18

20

Time

Pow

er [W

]

C2C3

C4

C0P1

C1

C0P0



· CPU enters Idle state again

0

2

4

6

8

10

12

14

16

18

20

Time

Pow

er [W

]

C2C3

C4

C0P1

C2C3

C4C1

C0P0



· Further lowering the P state· DVD play runs at lowest P state

0

2

4

6

8

10

12

14

16

18

20

Time

Pow

er [W

]

C2C3

C4

C0P1

C0P2C2

C3C4

C1

C0P0


Voltage and Frequency Domains· Two Independent Variable Power Planes

– CPU cores, ring and LLC Embedded power gates – each core

can be turned off individually Cache power gating – turn off portions

or all cache at deeper sleep states– Graphics processor

Can be varied or turned off when not active

· Shared frequency for all IA32 cores and ring

· Independent frequency for PG· Fixed Programmable power plane

for System Agent – Optimize SA power consumption– System On Chip functionality and PCU logic– Periphery: DDR, PCIe, Display

VCC Core(Gated)

VCC Core(Gated)

VCC Core(Gated)

VCC Core(Gated)

VCC Core(ungated)

VCC SA

VCC Graphics

VCC Periphery

VCC Periphery

Em

bed

ded

po

wer

gat

es


Turbo Mode· P1 is guaranteed frequency

– CPU and GFX simultaneous heavy load at worst case conditions– Actual power has high dynamic range

· P0 is max possible frequency – the Turbo frequency– P1-P0 has significant frequency range (GHz)

Single thread or lightly loaded applications GFX <>CPU balancing

– OS treats P0 as any other P-state Requesting is when it needs more performance

– P1 to P0 range is fully H/W controlled Frequency transitions handled completely in HW PCU keeps silicon within existing operating limits

– Systems designed to same specs, with or without Turbo Mode

· Pn is the energy efficient state– Lower than Pn is controlled by Thermal-State

“Turbo”H/W

Control

OS VisibleStates

OS Control

T-state &Throttle

P1

Pn

P0 1C

freq

uen

cy

LFM


Turbo Mode

Freq

uency

(F)

Frequency

(F)

No Turbo

Core

0

Core

1

Core

2

Core

3

Core

2

Core

3

Core

0

Core

1

Power GatingZero power for inactive cores

Workload Lightly Threaded


Turbo Mode

Workload Lightly Threaded

Freq

uency

(F)

Frequency

(F)

No Turbo

Core

0

Core

1

Core

2

Core

3

Turbo ModeUse thermal budget of

inactive core to increase

frequency of active cores

Core

0

Core

1



Turbo Mode

Freq

uency

(F)

Frequency

(F)

No Turbo

Core

0

Core

1

Core

2

Core

3

Workload Lightly Threaded C

ore

0

Core

1


Turbo ModeUse thermal budget of

inactive core to increase

frequency of active cores


Turbo Mode

Active cores running workloads < TDP

Fre

qu

en

cy (

F)

Fre

qu

en

cy (

F)

No Turbo

Core

0

Core

1

Core

2

Core

3

Core

2

C

ore

3

Core

0

C

ore

1

Core

2

Core

3

Core

0

Core

1

Turbo ModeIncrease frequency

within thermal headroom


Turbo Mode

Freq

uency

(F)

Frequency

(F)

No Turbo

Core

0

Core

1

Core

2

Core

3 Workload Lightly Threaded

And active cores < TDP

Core

2

C

ore

3

Core

1

Core

0

Turbo ModeIncrease frequency

within thermal headroom



Thermal Capacitance

Classic ModelSteady-State Thermal Resistance

Design guide for steady state

Tem

pera

ture

Time

Classic model response

Temperature rises as energy is delivered to thermal solutionThermal solution response is calculated at real-time

Tem

pera

ture

Time

More realistic response to power

changes

New ModelSteady-State Thermal Resistance

ANDDynamic Thermal Capacitance



Time

Power

Sleep orLow power

Turbo Boost 2.0

“TDP”

C0/P0(Turbo)

After idle periods, the system accumulates “energy budget” and can accommodate high power/performance for a few seconds

In Steady State conditions the power stabilizes on TDP

P > TDP:

Responsiveness

Sust

ain

pow

erBuildup thermal budget during idle periods

Use accumulated

energy budget to enhance user

experience

Intel® Turbo Boost Technology 2.0



Core and Graphic Power Budgeting• Cores and Graphics integrated on the same die with

separate voltage/frequency controls; tight HW control• Full package power specifications available for sharing• Power budget can shift between Cores and Graphics

Core Power [W]

Graphics Power[W]

Total package power

Realistic concurrentmax power

Sum of max power

Heavy Graphicsworkload

Heavy CPUworkload

SpecificationCore Power

Specification Graphics Power

Applications

Sandy Bridge Next Gen Turbo

for short periods


Documents

Computer Structure 2014 – Advanced Topics 1 Computer Structure Advanced Topics Computer Structure Advanced Topics Lihu Rappoport and Adi Yoaz