Computer Architecture Advanced Topics

Computer Architecture 2011 – Advanced Topics 1

Computer Architecture

Advanced Topics


Performance per Watt· Mobile smaller form-factor decreases power budget

– Power generates heat, which must be dissipated to keep transistors within allowed temperature

– Limits the processor’s peak power consumption

· Change the target– Old target: get max performance– New target: get max performance at a given power envelope

Performance per Watt

· Performance via frequency increase– Power = CV2f, but increasing f also requires increasing V– X% performance costs 3X% power

Assume performance linear with frequency

· A power efficient feature – better than 1:3 performance : power– Otherwise it is better to just increase frequency (and voltage)– U-arch performance features should be power efficient


Higher Performance vs.Longer Battery Life

· Processor average power is <10% of the platform– The processor reduces power in periods of

low processor activity– The processor enters lower power states in

idle periods· Average power includes low-activity

periods and idle-time– Typical: 1W – 3W

· Max power limited by heat dissipation– Typical: 20W – 100W

· Decision– Optimize for performance when Active– Optimize for battery life when idle

Display(panel + inverter)

33%

CPU10%

Power Supply10%

Intel® MCH9%

Misc.8%

GFX8%

HDD8%

CLK5%

Intel® ICH3%

DVD2%

LAN2%

Fan2%


Leakage Power· The power consumed by a processor consists of

– Active power: used to switch transistors– Leakage power: leakage of transistors under voltage

· Leakage power is a function of– Number of transistors and their size– Operating voltage– Die temperature

· Leakage power reduction– The LLC (Last Level Cache) is built with low-leakage transistors (2/3 of the

die transistors) Low-leakage transistors are slower, increasing cache access latency The significant power saved justifies the small performance loss

– Enhanced SpeedStep® technology Reduces voltage and frequency on low processor activity


Enhanced SpeedStep™ Technology· The “Basic” SpeedStep™ Technology had

– 2 operating points – Non-transparent switch

· The “Enhanced” version provides– Multi voltage/frequency operating points

For example, 1.6GHz Pentium M processor operation ranges:

From 600MHz @ 0.956VTo 1.6GHz @ 1.484V

– Transparent switch– Frequent switches

· Benefits– Higher power efficiency

2.7X lower frequency 2X performance loss >2X energy gain

– Outstanding battery life

Voltage, Frequency, Power

0.0

0.4

0.8

1.2

1.6

2.0

2.4

2.8

3.2

3.6

4.0

0.8 1.0 1.2 1.4 1.6Voltage (Volt)

Freq

uenc

y

( GHz

)

0

2

4

6

8

10

12

14

16

18

20

Typi

cal P

ower

( Wat

ts

)

Freq (GHz)Power (Watts)

2.7X

6.1X

Efficiency ratio = 2.3


2nd Generation Intel® CoreTM Sandy Bridge


2nd Gen Intel® Core™ Microarchitecture: Overview

Integrated Memory Controller 2ch DDR3

High Bandwidth Last Level Cache

Next Generation Processor Graphics and Media

Next Generation Intel® Turbo Boost Technology

Intel® Hyper-Threading Technology

4 Cores / 8 Threads 2 Cores / 4 Threads

Integrates CPU, Graphics, MC, PCI Express* On Single Chip

Embedded DisplayPort

Substantial performance improvement

Intel® Advanced Vector Extension (Intel® AVX)

High BW/low-latency modular core/GFX interconnect

Discrete Graphics Support: 1x16 or 2x8

2ch DDR3

x16 PCIe

PECI InterfaceTo Embedded

ControllerNotebookDP Port

Graphics

Core LLC

Core LLC

Core LLC

Core LLC

System Agent

Display

DMI PCI Express*

IMC

PCH

http://shop.lenovo.com/SEUILibrary/controller/e/web/LenovoPortal/en_US/catalog.workflow:category.details?current-catalog-id=12F0696583E04D86B9B79B0FEC01C087&current-category-id=329576204C9E42289967E79E0E7C9A2D


ALU, SIMUL, DIV, FP MUL

ALU, SIALU, FP ADD

ALU, Branch, FP Shuffle

Load LoadStore Address Store Address

Store DataSix Execution Ports

32k L1 Instruction Cache

SchedulerPort 0 Port 1 Port 5 Port 2 Port 3 Port 4

32k L1 Data Cache

48 bytes/cycle

Allocate/Rename/RetireZeroing IdiomsLoad Buffers

Store Buffers

ReorderBuffers

L2 Cache (MLC) Fill Buffers

Pre decode Instruction Queue Decoders

1.5k uOP cache

DecodersDecodersDecodersBranch Pred

In order

Out-of-order

Front End (IA instructions Uops)

In Order Allocation, Rename, Retirement

Out of Order “Uop” Scheduling

DataCacheUnit

Core Block Diagram


Front End

Instruction Fetch and Decode• 32KB 8-way Associative ICache• 4 Decoders, up to 4 instructions / cycle• Micro-Fusion

– Bundle multiple instruction events into a single “Uops”• Macro-Fusion

– Fuse instruction pairs into a complex “Uop” • Decode Pipeline supports 16 bytes per cycle

32KB L1 I-Cache Pre decode Instruction Queue DecodersDecodersDecodersDecoders

Branch Prediction Unit


Decoded Uop Cache ~1.5 Kuops

Decoded Uop Cache

Decoded Uop Cache• Instruction Cache for Uops instead of Instruction Bytes

– ~80% hit rate for most applications

• Higher Instruction Bandwidth and Lower Latency– Decoded Uop Cache can represent 32-byte / cycle

More Cycles sustaining 4 instruction/cycle– Able to ‘stitch’ across taken branches in the control flow


32KB L1 I-Cache Pre decode Instruction Queue DecodersDecodersDecodersDecoders



New Branch Predictor• Twice as many targets • Much more effective storage for history• Much longer history for data dependent behaviors


32k L1 Instruction Cache Pre decode Instruction Queue DecodersDecodersDecodersDecoders



• Decoded Uop Cache lets the normal front end sleep– Decode one time instead of many times

• Branch-Mispredictions reduced substantially– The correct path is also the most efficient path

Save Power while Increasing Performance

Front End


32k L1 Instruction Cache Pre decode Instruction Queue DecodersDecodersDecodersDecoders


Zzzz


SchedulerPort 0 Port 1 Port 5 Port 2 Port 3 Port 4


Store Buffers

ReorderBuffers

In order

Out-of-order

In Order Allocation, Rename, Retirement

Out of Order “Uop” Scheduling

“Out of Order” Part of the machine

• Receives Uops from the Front End • Sends them to Execution Units when they are ready• Retires them in Program Order

• Increase Performance by finding more Instruction Level Parallelism – Increasing Depth and Width of machine implies larger buffers

More Data Storage, More Data Movement, More Power


Sandy Bridge Out-of-Order (OOO) Cluster

• Method: Physical Reg File (PRF) instead of centralized Retirement Register File– Single copy of every data– No movement after calculation

• Allows significant increase in buffer sizes– Dataflow window ~33% larger

Nehalem Sandy Bridge

Load Buffers 48 64Store Buffers 32 36RS - Scheduler Entries

36 54

PRF integer N/A 160PRF float-point N/A 144

ROB Entries 128 168

Scheduler

Allocate/Rename/Retire Zeroing IdiomsLoad Buffers

Store Buffers

ReorderBuffers

In order

Out-of-order

FP/INT Vector PRF Int PRF

PRF has better than linear performance/power

Key enabler for Intel® AVX


• Vectors are a natural data-type for many applications

• Extend SSE FP instruction set to 256 bits operand size– Intel AVX extends all

16 XMM registers to 256bits

• New, non-destructive source syntax– VADDPS ymm1, ymm2, ymm3

• New Operations to enhance vectorization– Broadcasts – Masked load & store

Intel® Advanced Vector Extensions

YMM0XMM0

128 bits256 bits (AVX)

Wider vectors and non-destructive source specify more work with fewer instructions

Extending the existing state is area and power efficient


Port 0

Port 1

Port 5

GPR SIMD INT SIMD FP

Execution ClusterScheduler sees matrix:• 3 “ports” to 3 “stacks”

of execution units

• General Purpose Integer– SIMD (Vector) Integer– SIMD Floating Point

• Challenge: double the output of one of these stacks in a manner that is invisible to the others

ALU

ALU

ALU

JMP

FP Shuf

FP Bool

VI ADD

VI MULVI Shuffle

DIV

VI Shuffle

FP ADD

Blend

Blend

FP MUL


Port 0

Port 1

Port 5

GPR SIMD INT SIMD FP

Execution ClusterSolution:• Repurpose existing data

paths to dual-use

• SIMD integer and legacy SIMD FP use legacy stack style

• Intel® AVX utilizes both 128-bit execution stacks

• Double FLOPs– 256-bit Multiply +

256-bit ADD + 256-bit Load per clock

ALU

ALU

ALU

JMP

FP ShufFP Bool

VI ADD

VI MUL

VI Shuffle DIV

VI Shuffle

FP ADD

Blend

Blend

FP MUL

FP ADD

FP MultiplyFP Blend

FP Shuffle

FP BooleanFP Blend


Memory Cluster

• Memory Unit can service two memory requests per cycle– 16 bytes load and 16 bytes store per cycle

• Goal:Maintain the historic bytes/flop ratio of SSE for Intel® AVX

Memory Control

32KB 8-way L1 Data Cache

32 bytes/cycle

Load Store Address

Store Data

256KB L2 Cache (MLC) Fill Buffers

Store Buffers


Memory Cluster

• Solution : Dual-Use the existing connections – Make load/store pipes symmetric

• Memory Unit services three data accesses per cycle– 2 read requests of up to 16 bytes AND 1 store of up to 16 bytes– Internal sequencer deals with queued requests

• Second Load Port is one of highest performance features– Required to keep Intel® AVX fed – linear power/performance

Memory Control

32KB 8-way L1 Data Cache

48 bytes/cycle

Load LoadStore Address Store Address

Store Data

256KB L2 Cache (MLC) Fill Buffers

Store Buffers


Putting it togetherSandy Bridge Microarchitecture

32k L1 Instruction Cache

Scheduler

Memory Control

Port 0 Port 1 Port 5 Port 2 Port 3 Port 4

32k L1 Data Cache

48 bytes/cycle


Store Buffers

ReorderBuffers

Load LoadStore AddressStore Address

Store Data

L2 Data Cache (MLC) Fill Buffers

Pre decode Instruction Queue Decoders

1.5k uOP cache

DecodersDecodersDecoders

Branch Pred

In order

Out-of-order

ALUVI ADD

VI ShuffleAVX FP ADD

ALUJMP

AVX/FP ShufAVX/FP Bool

AVX FP Blend

ALUVI MUL

VI ShuffleDIV

AVX FP BlendAVX FP MUL


Other Architectural Extensions• Cryptography Instruction Throughput Enhancements

– Increased throughput for AES instructions

• Arithmetic Throughput Enhancements– ADC (Add with Carry) throughput doubled– Multiply (64-bit multiplicands with 128-bit product)

~25% speedup on existing RSA binaries

• State Save/Restore Enhancements– New state added in Intel® AVX– HW monitors features used by applications

Only saves/restores state that is used


System Agent, Ring Architecture and Other

Innovations in 2nd Generation Intel® Core™

Microarchitecture formerly codenamed

Sandy Bridge

2ch DDR3

x16 PCIe

PECI InterfaceTo Embedded

Controller

NotebookDP Port

Graphics

Core LLC

Core LLC

Core LLC

Core LLC

System AgentDisplay

DMI PCI Express

IMC

2011 PCHDMI

2nd Gen Intel® Core™ Microarchitecture

Core

s

Graphics


























Integration: Optimization Opportunities• Dynamically redistribute power between Cores & Graphics• Tight power management control of all components,

providing better granularity and deeper idle/sleep states• Three separate power/frequency domains:

System Agent (Fixed), Cores+Ring, Graphics (Variable)• High BW Last Level Cache, shared among Cores and Graphics

– Significant performance boost, saves memory bandwidth and power• Integrated Memory Controller and PCI Express ports

– Tightly integrated with Core/Graphics/LLC domain– Provides low latency & low power – remove intermediate busses

• Bandwidth is balanced across the whole machine, from Core/Graphics all the way to Memory Controller

• Modular uArch for optimal cost/power/performance– Derivative products done with minimal effort/time


Scalable Ring On-die Interconnect• Ring-based interconnect between Cores, Graphics,

Last Level Cache (LLC) and System Agent domain

Graphics

Core LLC

Core

Core LLC

Core LLC

System Agent Display

DMI PCI Express*

IMC

LLC

• Composed of 4 rings– 32 Byte Data ring, Request ring,

Acknowledge ring and Snoop ring– Fully pipelined at core frequency/voltage:

bandwidth, latency and power scale with cores• Massive ring wire routing runs over the LLC

with no area impact• Access on ring always picks the shortest path –

minimize latency• Distributed arbitration, ring protocol handles

coherency, ordering, and core interface• Scalable to servers with large number of

processors

High Bandwidth, Low Latency, Modular


Cache Box• Interface block

– Between Core/Graphics/Media and the Ring– Between Cache controller and the Ring– Implements the ring logic, arbitration, cache controller– Communicates with System Agent for LLC misses,

external snoops, non-cacheable accesses• Full cache pipeline in each cache box

Graphics

Core LLC

Core

Core LLC

Core LLC


DMI PCI Express*

IMC

LLC

– Physical Addresses are hashed at the source to prevent hot spots and increase bandwidth

– Maintains coherency and ordering for the addresses that are mapped to it

– LLC is fully inclusive with “Core Valid Bits” – eliminates unnecessary snoops to cores

• Runs at core voltage/frequency, scales with CoresDistributed coherency & ordering; Scalable

Bandwidth, Latency & Power


LLC Sharing• LLC shared among all Cores, Graphics and Media

– Graphics driver controls which streams are cached/coherent– Any agent can access all data in the LLC, independent of who

allocated the line, after memory range checks• Controlled LLC way allocation mechanism to prevent

thrashing between Core/graphics

Graphics

Core LLC

Core

Core LLC

Core LLC


DMI PCI Express*

IMC

LLC

Much higher Graphics performance, DRAM power savings, more DRAM BW

available for Cores

• Multiple coherency domains– IA Domain (Fully coherent via cross-snoops)– Graphic domain (Graphics virtual caches,

flushed to IA domain by graphics engine)– Non-Coherent domain (Display data,

flushed to memory by graphics engine)


System Agent• Contains PCI Express, DMI, Memory

Controller, Display Engine…• Contains Power Control Unit

– Programmable uController, handles all power management and reset functions in the chip

• Smart integration with the ring– Provides cores/Graphics /Media with high BW,

low latency to DRAM/IO for best performance– Handles IO-to-cache coherency

• Separate voltage and frequency from ring/cores, Display integration for better battery life

• Extensive power and thermal management for PCI Express and DDR

Smart I/O Integration

Graphics

Core LLC

Core

Core LLC

Core LLC


DMI PCI Express*

IMC

LLC


Hyper Threading Technology


Thread-Level Parallelism· Multiprocessor systems have been used for many years

– There are known techniques to exploit multiprocessors

· Software trends– Applications consist of multiple threads or processes that can be

executed in parallel on multiple processors

· Thread-level parallelism (TLP) – threads can be from– the same application– different applications running simultaneously– operating system services

· Increasing single thread performance becomes harder– and is less and less power efficient

· Chip Multi-Processing (CMP)– Two (or more) processors are put on a single die


Multi-Threading· Multi-threading: a single processor executes multiple threads· Time-slice multithreading

– The processor switches between software threads after a fixed period– Can effectively minimize the effects of long latencies to memory

· Switch-on-event multithreading – Switch threads on long latency events such as cache misses – Works well for server applications that have many cache misses

· A deficiency of both time-slice MT and switch-on-event MT– They do not cover for branch mis-predictions and long dependencies

· Simultaneous multi-threading (SMT)– Multiple threads execute on a single processor simultaneously w/o switching– Makes the most effective use of processor resources

Maximizes performance vs. transistor count and power


Hyper-threading (HT) Technology· HT is SMT

– Makes a single processor appear as 2 logical processors = threads

· Each thread keeps a its own architectural state– General-purpose registers– Control and machine state registers

· Each thread has its own interrupt controller – Interrupts sent to a specific logical processor are handled only by it

· OS views logical processors (threads) as physical processors– Schedule threads to logical processors as in a multiprocessor system

· From a micro-architecture perspective– Thread share a single set of physical resources

caches, execution units, branch predictors, control logic, and buses


Two Important Goals· When one thread is stalled the other thread can continue to

make progress– Independent progress ensured by either

Partitioning buffering queues and limiting the number of entries each thread can use

Duplicating buffering queues

· A single active thread running on a processor with HT runs at the same speed as without HT – Partitioned resources are recombined when only one thread is active


Front End· Each thread manages its own next-instruction-pointer· Threads arbitrate Uop cache access every cycle (Ping-Pong)

– If both want to access the UC – access granted in alternating cycles – If one thread is stalled, the other thread gets the full UC bandwidth

· TC entries are tagged with thread-ID – Dynamically allocated as needed– Allows one logical processor to have more entries than the other

Uop Cache


Front End (cont.)· Branch prediction structures are either duplicated or shared

– The return stack buffer is duplicated – Global history is tracked for each thread– The large global history array is a shared

Entries are tagged with a logical processor ID

· Each thread has its own ITLB

· Both threads share the same decoder logic– if only one needs the decode logic, it gets the full decode bandwidth – The state needed by the decodes is duplicated

· Uop queue is hard partitioned– Allows both logical processors to make independent forward progress

regardless of FE stalls (e.g., TC miss) or EXE stalls


Out-of-order Execution· ROB and MOB are hard partitioned

– Enforce fairness and prevent deadlocks· Allocator ping-pongs between the thread

– A thread is selected for allocation if Its uop-queue is not empty its buffers (ROB, RS) are not full It is the thread’s turn, or the other thread cannot be selected


Out-of-order Execution (cont)· Registers renamed to a shared physical register pool

– Store results until retirement· After allocation and renaming uops are placed in one of 2 Qs

– Memory instruction queue and general instruction queue The two queues are hard partitioned

– Uops are read from the Q’s and sent to the scheduler using ping-pong

· The schedulers are oblivious to threads – Schedule uops based on dependencies and exe. resources availability

Regardless of their thread– Uops from the two threads can be dispatched in the same cycle– To avoid deadlock and ensure fairness

Limit the number of active entries a thread can have in each scheduler’s queue

· Forwarding logic compares physical register numbers– Forward results to other uops without thread knowledge


Out-of-order Execution (cont)· Memory is largely oblivious

– L1 Data Cache, L2 Cache, L3 Cache are thread oblivious All use physical addresses

– DTLB is shared Each DTLB entry includes a thread ID as part of the tag

· Retirement ping-pongs between threads– If one thread is not ready to retire uops all retirement bandwidth is

dedicated to the other thread


Single-task And Multi-task Modes· MT-mode (Multi-task mode)

– Two active threads, with some resources partitioned as described earlier

· ST-mode (Single-task mode)– There are two flavors of ST-mode

single-task thread 0 (ST0) – only thread 0 is active single-task thread 1 (ST1) – only thread 1 is active

– Resources that were partitioned in MT-mode are re-combined to give the single active logical processor use of all of the resources

· Moving the processor from between modes

ST0 ST1

MTThread 1 executes HALT

LowPower Thread 1 executes HALT

Thread 0 executes HALT

Thread 0 executes HALT

Interrupt


Operating System And Applications· An HT processor appears to the OS and application SW as 2

processors– The OS manages logical processors as it does physical processors

The OS should implement two optimizations:

· Use HALT if only one logical processor is active– Allows the processor to transition to either the ST0 or ST1 mode – Otherwise the OS would execute on the idle logical processor a sequence of

instructions that repeatedly checks for work to do – This so-called “idle loop” can consume significant execution resources that

could otherwise be used by the other active logical processor

· On a multi-processor system, – Schedule threads to logical processors on different physical processors

before scheduling multiple threads to the same physical processor– Allows SW threads to use different physical resources when possible

Documents

Computer Architecture Advanced Topics