Upload
tessa
View
62
Download
0
Embed Size (px)
DESCRIPTION
Computer Architecture Advanced Topics . Performance per Watt. Mobile smaller form-factor decreases power budget Power generates heat, which must be dissipated to keep transistors within allowed temperature Limits the processor’s peak power consumption Change the target - PowerPoint PPT Presentation
Citation preview
Computer Architecture 2011 – Advanced Topics 1
Computer Architecture
Advanced Topics
Computer Architecture 2011 – Advanced Topics 2
Performance per Watt· Mobile smaller form-factor decreases power budget
– Power generates heat, which must be dissipated to keep transistors within allowed temperature
– Limits the processor’s peak power consumption
· Change the target– Old target: get max performance– New target: get max performance at a given power envelope
Performance per Watt
· Performance via frequency increase– Power = CV2f, but increasing f also requires increasing V– X% performance costs 3X% power
Assume performance linear with frequency
· A power efficient feature – better than 1:3 performance : power– Otherwise it is better to just increase frequency (and voltage)– U-arch performance features should be power efficient
Computer Architecture 2011 – Advanced Topics 3
Higher Performance vs.Longer Battery Life
· Processor average power is <10% of the platform– The processor reduces power in periods of
low processor activity– The processor enters lower power states in
idle periods· Average power includes low-activity
periods and idle-time– Typical: 1W – 3W
· Max power limited by heat dissipation– Typical: 20W – 100W
· Decision– Optimize for performance when Active– Optimize for battery life when idle
Display(panel + inverter)
33%
CPU10%
Power Supply10%
Intel® MCH9%
Misc.8%
GFX8%
HDD8%
CLK5%
Intel® ICH3%
DVD2%
LAN2%
Fan2%
Computer Architecture 2011 – Advanced Topics 4
Leakage Power· The power consumed by a processor consists of
– Active power: used to switch transistors– Leakage power: leakage of transistors under voltage
· Leakage power is a function of– Number of transistors and their size– Operating voltage– Die temperature
· Leakage power reduction– The LLC (Last Level Cache) is built with low-leakage transistors (2/3 of the
die transistors) Low-leakage transistors are slower, increasing cache access latency The significant power saved justifies the small performance loss
– Enhanced SpeedStep® technology Reduces voltage and frequency on low processor activity
Computer Architecture 2011 – Advanced Topics 5
Enhanced SpeedStep™ Technology· The “Basic” SpeedStep™ Technology had
– 2 operating points – Non-transparent switch
· The “Enhanced” version provides– Multi voltage/frequency operating points
For example, 1.6GHz Pentium M processor operation ranges:
From 600MHz @ 0.956VTo 1.6GHz @ 1.484V
– Transparent switch– Frequent switches
· Benefits– Higher power efficiency
2.7X lower frequency 2X performance loss >2X energy gain
– Outstanding battery life
Voltage, Frequency, Power
0.0
0.4
0.8
1.2
1.6
2.0
2.4
2.8
3.2
3.6
4.0
0.8 1.0 1.2 1.4 1.6Voltage (Volt)
Freq
uenc
y
( GHz
)
0
2
4
6
8
10
12
14
16
18
20
Typi
cal P
ower
( Wat
ts
)
Freq (GHz)Power (Watts)
2.7X
6.1X
Efficiency ratio = 2.3
Computer Architecture 2011 – Advanced Topics 6
2nd Generation Intel® CoreTM Sandy Bridge
Computer Architecture 2011 – Advanced Topics 7
2nd Gen Intel® Core™ Microarchitecture: Overview
Integrated Memory Controller 2ch DDR3
High Bandwidth Last Level Cache
Next Generation Processor Graphics and Media
Next Generation Intel® Turbo Boost Technology
Intel® Hyper-Threading Technology
4 Cores / 8 Threads 2 Cores / 4 Threads
Integrates CPU, Graphics, MC, PCI Express* On Single Chip
Embedded DisplayPort
Substantial performance improvement
Intel® Advanced Vector Extension (Intel® AVX)
High BW/low-latency modular core/GFX interconnect
Discrete Graphics Support: 1x16 or 2x8
2ch DDR3
x16 PCIe
PECI InterfaceTo Embedded
ControllerNotebookDP Port
Graphics
Core LLC
Core LLC
Core LLC
Core LLC
System Agent
Display
DMI PCI Express*
IMC
PCH
Computer Architecture 2011 – Advanced Topics 8
ALU, SIMUL, DIV, FP MUL
ALU, SIALU, FP ADD
ALU, Branch, FP Shuffle
Load LoadStore Address Store Address
Store DataSix Execution Ports
32k L1 Instruction Cache
SchedulerPort 0 Port 1 Port 5 Port 2 Port 3 Port 4
32k L1 Data Cache
48 bytes/cycle
Allocate/Rename/RetireZeroing IdiomsLoad Buffers
Store Buffers
ReorderBuffers
L2 Cache (MLC) Fill Buffers
Pre decode Instruction Queue Decoders
1.5k uOP cache
DecodersDecodersDecodersBranch Pred
In order
Out-of-order
Front End (IA instructions Uops)
In Order Allocation, Rename, Retirement
Out of Order “Uop” Scheduling
DataCacheUnit
Core Block Diagram
Computer Architecture 2011 – Advanced Topics 9
Front End
Instruction Fetch and Decode• 32KB 8-way Associative ICache• 4 Decoders, up to 4 instructions / cycle• Micro-Fusion
– Bundle multiple instruction events into a single “Uops”• Macro-Fusion
– Fuse instruction pairs into a complex “Uop” • Decode Pipeline supports 16 bytes per cycle
32KB L1 I-Cache Pre decode Instruction Queue DecodersDecodersDecodersDecoders
Branch Prediction Unit
Computer Architecture 2011 – Advanced Topics 10
Decoded Uop Cache ~1.5 Kuops
Decoded Uop Cache
Decoded Uop Cache• Instruction Cache for Uops instead of Instruction Bytes
– ~80% hit rate for most applications
• Higher Instruction Bandwidth and Lower Latency– Decoded Uop Cache can represent 32-byte / cycle
More Cycles sustaining 4 instruction/cycle– Able to ‘stitch’ across taken branches in the control flow
Branch Prediction Unit
32KB L1 I-Cache Pre decode Instruction Queue DecodersDecodersDecodersDecoders
Computer Architecture 2011 – Advanced Topics 11
Branch Prediction Unit
New Branch Predictor• Twice as many targets • Much more effective storage for history• Much longer history for data dependent behaviors
Branch Prediction Unit
32k L1 Instruction Cache Pre decode Instruction Queue DecodersDecodersDecodersDecoders
Decoded Uop Cache ~1.5 Kuops
Computer Architecture 2011 – Advanced Topics 12
• Decoded Uop Cache lets the normal front end sleep– Decode one time instead of many times
• Branch-Mispredictions reduced substantially– The correct path is also the most efficient path
Save Power while Increasing Performance
Front End
Branch Prediction Unit
32k L1 Instruction Cache Pre decode Instruction Queue DecodersDecodersDecodersDecoders
Decoded Uop Cache ~1.5 Kuops
Zzzz
Computer Architecture 2011 – Advanced Topics 13
SchedulerPort 0 Port 1 Port 5 Port 2 Port 3 Port 4
Allocate/Rename/RetireZeroing IdiomsLoad Buffers
Store Buffers
ReorderBuffers
In order
Out-of-order
In Order Allocation, Rename, Retirement
Out of Order “Uop” Scheduling
“Out of Order” Part of the machine
• Receives Uops from the Front End • Sends them to Execution Units when they are ready• Retires them in Program Order
• Increase Performance by finding more Instruction Level Parallelism – Increasing Depth and Width of machine implies larger buffers
More Data Storage, More Data Movement, More Power
Computer Architecture 2011 – Advanced Topics 14
Sandy Bridge Out-of-Order (OOO) Cluster
• Method: Physical Reg File (PRF) instead of centralized Retirement Register File– Single copy of every data– No movement after calculation
• Allows significant increase in buffer sizes– Dataflow window ~33% larger
Nehalem Sandy Bridge
Load Buffers 48 64Store Buffers 32 36RS - Scheduler Entries
36 54
PRF integer N/A 160PRF float-point N/A 144
ROB Entries 128 168
Scheduler
Allocate/Rename/Retire Zeroing IdiomsLoad Buffers
Store Buffers
ReorderBuffers
In order
Out-of-order
FP/INT Vector PRF Int PRF
PRF has better than linear performance/power
Key enabler for Intel® AVX
Computer Architecture 2011 – Advanced Topics 15
• Vectors are a natural data-type for many applications
• Extend SSE FP instruction set to 256 bits operand size– Intel AVX extends all
16 XMM registers to 256bits
• New, non-destructive source syntax– VADDPS ymm1, ymm2, ymm3
• New Operations to enhance vectorization– Broadcasts – Masked load & store
Intel® Advanced Vector Extensions
YMM0XMM0
128 bits256 bits (AVX)
Wider vectors and non-destructive source specify more work with fewer instructions
Extending the existing state is area and power efficient
Computer Architecture 2011 – Advanced Topics 16
Port 0
Port 1
Port 5
GPR SIMD INT SIMD FP
Execution ClusterScheduler sees matrix:• 3 “ports” to 3 “stacks”
of execution units
• General Purpose Integer– SIMD (Vector) Integer– SIMD Floating Point
• Challenge: double the output of one of these stacks in a manner that is invisible to the others
ALU
ALU
ALU
JMP
FP Shuf
FP Bool
VI ADD
VI MULVI Shuffle
DIV
VI Shuffle
FP ADD
Blend
Blend
FP MUL
Computer Architecture 2011 – Advanced Topics 17
Port 0
Port 1
Port 5
GPR SIMD INT SIMD FP
Execution ClusterSolution:• Repurpose existing data
paths to dual-use
• SIMD integer and legacy SIMD FP use legacy stack style
• Intel® AVX utilizes both 128-bit execution stacks
• Double FLOPs– 256-bit Multiply +
256-bit ADD + 256-bit Load per clock
ALU
ALU
ALU
JMP
FP ShufFP Bool
VI ADD
VI MUL
VI Shuffle DIV
VI Shuffle
FP ADD
Blend
Blend
FP MUL
FP ADD
FP MultiplyFP Blend
FP Shuffle
FP BooleanFP Blend
Computer Architecture 2011 – Advanced Topics 18
Memory Cluster
• Memory Unit can service two memory requests per cycle– 16 bytes load and 16 bytes store per cycle
• Goal:Maintain the historic bytes/flop ratio of SSE for Intel® AVX
Memory Control
32KB 8-way L1 Data Cache
32 bytes/cycle
Load Store Address
Store Data
256KB L2 Cache (MLC) Fill Buffers
Store Buffers
Computer Architecture 2011 – Advanced Topics 19
Memory Cluster
• Solution : Dual-Use the existing connections – Make load/store pipes symmetric
• Memory Unit services three data accesses per cycle– 2 read requests of up to 16 bytes AND 1 store of up to 16 bytes– Internal sequencer deals with queued requests
• Second Load Port is one of highest performance features– Required to keep Intel® AVX fed – linear power/performance
Memory Control
32KB 8-way L1 Data Cache
48 bytes/cycle
Load LoadStore Address Store Address
Store Data
256KB L2 Cache (MLC) Fill Buffers
Store Buffers
Computer Architecture 2011 – Advanced Topics 20
Putting it togetherSandy Bridge Microarchitecture
32k L1 Instruction Cache
Scheduler
Memory Control
Port 0 Port 1 Port 5 Port 2 Port 3 Port 4
32k L1 Data Cache
48 bytes/cycle
Allocate/Rename/RetireZeroing IdiomsLoad Buffers
Store Buffers
ReorderBuffers
Load LoadStore AddressStore Address
Store Data
L2 Data Cache (MLC) Fill Buffers
Pre decode Instruction Queue Decoders
1.5k uOP cache
DecodersDecodersDecoders
Branch Pred
In order
Out-of-order
ALUVI ADD
VI ShuffleAVX FP ADD
ALUJMP
AVX/FP ShufAVX/FP Bool
AVX FP Blend
ALUVI MUL
VI ShuffleDIV
AVX FP BlendAVX FP MUL
Computer Architecture 2011 – Advanced Topics 21
Other Architectural Extensions• Cryptography Instruction Throughput Enhancements
– Increased throughput for AES instructions
• Arithmetic Throughput Enhancements– ADC (Add with Carry) throughput doubled– Multiply (64-bit multiplicands with 128-bit product)
~25% speedup on existing RSA binaries
• State Save/Restore Enhancements– New state added in Intel® AVX– HW monitors features used by applications
Only saves/restores state that is used
Computer Architecture 2011 – Advanced Topics 22
System Agent, Ring Architecture and Other
Innovations in 2nd Generation Intel® Core™
Microarchitecture formerly codenamed
Sandy Bridge
2ch DDR3
x16 PCIe
PECI InterfaceTo Embedded
Controller
NotebookDP Port
Graphics
Core LLC
Core LLC
Core LLC
Core LLC
System AgentDisplay
DMI PCI Express
IMC
2011 PCHDMI
2nd Gen Intel® Core™ Microarchitecture
Core
s
Graphics
Computer Architecture 2011 – Advanced Topics 23
Integration: Optimization Opportunities• Dynamically redistribute power between Cores & Graphics• Tight power management control of all components,
providing better granularity and deeper idle/sleep states• Three separate power/frequency domains:
System Agent (Fixed), Cores+Ring, Graphics (Variable)• High BW Last Level Cache, shared among Cores and Graphics
– Significant performance boost, saves memory bandwidth and power• Integrated Memory Controller and PCI Express ports
– Tightly integrated with Core/Graphics/LLC domain– Provides low latency & low power – remove intermediate busses
• Bandwidth is balanced across the whole machine, from Core/Graphics all the way to Memory Controller
• Modular uArch for optimal cost/power/performance– Derivative products done with minimal effort/time
Computer Architecture 2011 – Advanced Topics 24
Scalable Ring On-die Interconnect• Ring-based interconnect between Cores, Graphics,
Last Level Cache (LLC) and System Agent domain
Graphics
Core LLC
Core
Core LLC
Core LLC
System Agent Display
DMI PCI Express*
IMC
LLC
• Composed of 4 rings– 32 Byte Data ring, Request ring,
Acknowledge ring and Snoop ring– Fully pipelined at core frequency/voltage:
bandwidth, latency and power scale with cores• Massive ring wire routing runs over the LLC
with no area impact• Access on ring always picks the shortest path –
minimize latency• Distributed arbitration, ring protocol handles
coherency, ordering, and core interface• Scalable to servers with large number of
processors
High Bandwidth, Low Latency, Modular
Computer Architecture 2011 – Advanced Topics 25
Cache Box• Interface block
– Between Core/Graphics/Media and the Ring– Between Cache controller and the Ring– Implements the ring logic, arbitration, cache controller– Communicates with System Agent for LLC misses,
external snoops, non-cacheable accesses• Full cache pipeline in each cache box
Graphics
Core LLC
Core
Core LLC
Core LLC
System Agent Display
DMI PCI Express*
IMC
LLC
– Physical Addresses are hashed at the source to prevent hot spots and increase bandwidth
– Maintains coherency and ordering for the addresses that are mapped to it
– LLC is fully inclusive with “Core Valid Bits” – eliminates unnecessary snoops to cores
• Runs at core voltage/frequency, scales with CoresDistributed coherency & ordering; Scalable
Bandwidth, Latency & Power
Computer Architecture 2011 – Advanced Topics 26
LLC Sharing• LLC shared among all Cores, Graphics and Media
– Graphics driver controls which streams are cached/coherent– Any agent can access all data in the LLC, independent of who
allocated the line, after memory range checks• Controlled LLC way allocation mechanism to prevent
thrashing between Core/graphics
Graphics
Core LLC
Core
Core LLC
Core LLC
System Agent Display
DMI PCI Express*
IMC
LLC
Much higher Graphics performance, DRAM power savings, more DRAM BW
available for Cores
• Multiple coherency domains– IA Domain (Fully coherent via cross-snoops)– Graphic domain (Graphics virtual caches,
flushed to IA domain by graphics engine)– Non-Coherent domain (Display data,
flushed to memory by graphics engine)
Computer Architecture 2011 – Advanced Topics 27
System Agent• Contains PCI Express, DMI, Memory
Controller, Display Engine…• Contains Power Control Unit
– Programmable uController, handles all power management and reset functions in the chip
• Smart integration with the ring– Provides cores/Graphics /Media with high BW,
low latency to DRAM/IO for best performance– Handles IO-to-cache coherency
• Separate voltage and frequency from ring/cores, Display integration for better battery life
• Extensive power and thermal management for PCI Express and DDR
Smart I/O Integration
Graphics
Core LLC
Core
Core LLC
Core LLC
System Agent Display
DMI PCI Express*
IMC
LLC
Computer Architecture 2011 – Advanced Topics 28
Hyper Threading Technology
Computer Architecture 2011 – Advanced Topics 29
Thread-Level Parallelism· Multiprocessor systems have been used for many years
– There are known techniques to exploit multiprocessors
· Software trends– Applications consist of multiple threads or processes that can be
executed in parallel on multiple processors
· Thread-level parallelism (TLP) – threads can be from– the same application– different applications running simultaneously– operating system services
· Increasing single thread performance becomes harder– and is less and less power efficient
· Chip Multi-Processing (CMP)– Two (or more) processors are put on a single die
Computer Architecture 2011 – Advanced Topics 30
Multi-Threading· Multi-threading: a single processor executes multiple threads· Time-slice multithreading
– The processor switches between software threads after a fixed period– Can effectively minimize the effects of long latencies to memory
· Switch-on-event multithreading – Switch threads on long latency events such as cache misses – Works well for server applications that have many cache misses
· A deficiency of both time-slice MT and switch-on-event MT– They do not cover for branch mis-predictions and long dependencies
· Simultaneous multi-threading (SMT)– Multiple threads execute on a single processor simultaneously w/o switching– Makes the most effective use of processor resources
Maximizes performance vs. transistor count and power
Computer Architecture 2011 – Advanced Topics 31
Hyper-threading (HT) Technology· HT is SMT
– Makes a single processor appear as 2 logical processors = threads
· Each thread keeps a its own architectural state– General-purpose registers– Control and machine state registers
· Each thread has its own interrupt controller – Interrupts sent to a specific logical processor are handled only by it
· OS views logical processors (threads) as physical processors– Schedule threads to logical processors as in a multiprocessor system
· From a micro-architecture perspective– Thread share a single set of physical resources
caches, execution units, branch predictors, control logic, and buses
Computer Architecture 2011 – Advanced Topics 32
Two Important Goals· When one thread is stalled the other thread can continue to
make progress– Independent progress ensured by either
Partitioning buffering queues and limiting the number of entries each thread can use
Duplicating buffering queues
· A single active thread running on a processor with HT runs at the same speed as without HT – Partitioned resources are recombined when only one thread is active
Computer Architecture 2011 – Advanced Topics 33
Front End· Each thread manages its own next-instruction-pointer· Threads arbitrate Uop cache access every cycle (Ping-Pong)
– If both want to access the UC – access granted in alternating cycles – If one thread is stalled, the other thread gets the full UC bandwidth
· TC entries are tagged with thread-ID – Dynamically allocated as needed– Allows one logical processor to have more entries than the other
Uop Cache
Computer Architecture 2011 – Advanced Topics 34
Front End (cont.)· Branch prediction structures are either duplicated or shared
– The return stack buffer is duplicated – Global history is tracked for each thread– The large global history array is a shared
Entries are tagged with a logical processor ID
· Each thread has its own ITLB
· Both threads share the same decoder logic– if only one needs the decode logic, it gets the full decode bandwidth – The state needed by the decodes is duplicated
· Uop queue is hard partitioned– Allows both logical processors to make independent forward progress
regardless of FE stalls (e.g., TC miss) or EXE stalls
Computer Architecture 2011 – Advanced Topics 35
Out-of-order Execution· ROB and MOB are hard partitioned
– Enforce fairness and prevent deadlocks· Allocator ping-pongs between the thread
– A thread is selected for allocation if Its uop-queue is not empty its buffers (ROB, RS) are not full It is the thread’s turn, or the other thread cannot be selected
Computer Architecture 2011 – Advanced Topics 36
Out-of-order Execution (cont)· Registers renamed to a shared physical register pool
– Store results until retirement· After allocation and renaming uops are placed in one of 2 Qs
– Memory instruction queue and general instruction queue The two queues are hard partitioned
– Uops are read from the Q’s and sent to the scheduler using ping-pong
· The schedulers are oblivious to threads – Schedule uops based on dependencies and exe. resources availability
Regardless of their thread– Uops from the two threads can be dispatched in the same cycle– To avoid deadlock and ensure fairness
Limit the number of active entries a thread can have in each scheduler’s queue
· Forwarding logic compares physical register numbers– Forward results to other uops without thread knowledge
Computer Architecture 2011 – Advanced Topics 37
Out-of-order Execution (cont)· Memory is largely oblivious
– L1 Data Cache, L2 Cache, L3 Cache are thread oblivious All use physical addresses
– DTLB is shared Each DTLB entry includes a thread ID as part of the tag
· Retirement ping-pongs between threads– If one thread is not ready to retire uops all retirement bandwidth is
dedicated to the other thread
Computer Architecture 2011 – Advanced Topics 38
Single-task And Multi-task Modes· MT-mode (Multi-task mode)
– Two active threads, with some resources partitioned as described earlier
· ST-mode (Single-task mode)– There are two flavors of ST-mode
single-task thread 0 (ST0) – only thread 0 is active single-task thread 1 (ST1) – only thread 1 is active
– Resources that were partitioned in MT-mode are re-combined to give the single active logical processor use of all of the resources
· Moving the processor from between modes
ST0 ST1
MTThread 1 executes HALT
LowPower Thread 1 executes HALT
Thread 0 executes HALT
Thread 0 executes HALT
Interrupt
Computer Architecture 2011 – Advanced Topics 39
Operating System And Applications· An HT processor appears to the OS and application SW as 2
processors– The OS manages logical processors as it does physical processors
The OS should implement two optimizations:
· Use HALT if only one logical processor is active– Allows the processor to transition to either the ST0 or ST1 mode – Otherwise the OS would execute on the idle logical processor a sequence of
instructions that repeatedly checks for work to do – This so-called “idle loop” can consume significant execution resources that
could otherwise be used by the other active logical processor
· On a multi-processor system, – Schedule threads to logical processors on different physical processors
before scheduling multiple threads to the same physical processor– Allows SW threads to use different physical resources when possible