Transcript
Page 1: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 1

Computer Architecture

Advanced Topics

Page 2: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 2

Pentium® M Processor

Page 3: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 3

From Pentium® M Processor· Intel’s 1st processor designed for mobility

– Achieve best performance at given power and thermal constraints– Achieve longest battery life

Banias Dothan Sandy Bridge

transistors 77M 140M 995M / 506M (55M / Core)

process 130nm 90nm 32nm

Die size 84 mm2 85mm2 216mm2 – 4C+GT2131mm2 - 2C +GT1

Peak power 24.5 watts 21 watts 17 – 90 W

Freq 1.7 GHz 2.1GHz 2.8 – 3.8 – 4.4GHz

L1 cache 32KB I$ + 32KB D$ 32KB I$ + 32KB D$ 32KB I$ + 32KB D$

L2 cache 1MB 2MB 256K (per core) +3-8MB L3

src: http://www.anandtech.com

Page 4: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 4

Example Standby Bridge· Use Moor’s Law and

process improvements to:· Power/Performance· Integration

· Reduce communication· Reduce Latencies· (cost in complexity)

· More Performance and Efficiency via :· Speed Step· Memory Hierarchy· Multi-Core· Multi-Thread· Out-of-Order Execution· Predictors· Multi-Operand (vector)

Instructions· Custom Processing

src: http://www.anandtech.com

Page 5: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 5

Performance per Watt· Mobile’s smaller form-factor decreases power budget

– Power generates heat, which must be dissipated to keep transistors within allowed temperature

– Limits the processor’s peak power consumption

· Change the target– Old target: get max performance– New target: get max performance at a given power envelope

Performance per Watt

· Performance via frequency increase– Power = CV2f, but increasing f also requires increasing V– X% performance costs 3X% power

Assume performance linear with frequency

· A power efficient feature – better than 1:3 performance : power– Otherwise it is better to just increase frequency– All Banias u-arch features (aimed at performance) are power efficient

Page 6: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 6

Higher Performance vs.Longer Battery Life

· Processor average power is <10% of the platform– The processor reduces power in periods of

low processor activity– The processor enters lower power states in

idle periods· Average power includes low-activity

periods and idle-time– Typical: 1W – 3W

· Max power limited by heat dissipation– Typical: 20W – 100W

· Decision– Optimize for performance when Active– Optimize for battery life when idle

Display(panel + inverter)

33%

CPU10%

Power Supply10%

Intel® MCH9%

Misc.8%

GFX8%

HDD8%

CLK5%

Intel® ICH3%

DVD2%

LAN2%

Fan2%

Yesterday

Numbers

src: http://www.anandtech.com

Page 7: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 7

Higher Performance vs.Longer Battery Life

· High Dynamic Range– Long periods of Idle w/ picks of activity– Minimize power when Idle– Adequate performance when active– Quick transitions

· Max power limited by heat dissipation– Typical: 3W (cell) – 6W (tablet)

15W (small PC) 60W (main stream PC) 150W+ (desktop)

– How can the design fit all ?

· Decision– Optimize for user experience when Active (adequate

performance)– Optimize for battery life when idle

Today

src: http://www.anandtech.com

Page 8: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 8

Static Power

· The power consumed by a processor consists of – Active power: used to switch transistors

– Static power: leakage of transistors under voltage

· Static power is a function of– Number of transistors and their type

– Operating voltage

– Die temperature

· Leakage is growing dramatically in new process technologies

· Pentium® M reduces static power consumption– The L2 cache is built with low-leakage transistors (2/3 of the die transistors)

Low-leakage transistors are slower, increasing cache access latency The significant power saved justifies the small performance loss

– Enhanced SpeedStep® technology Reduces voltage and temperature on low processor activity

Page 9: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 9

Less is More· Less instructions per task

– Advanced branch prediction reduces #wrong instructions executed– SSE instructions reduce the number of instructions architecturally

· Less uops per instruction– Uops fusion– Dedicated stack engine

· Less transistor switches per micro-op– efficient bus– various lower-level optimizations

· Less energy per transistor switch– Enhanced SpeedStep® technology

Power-awareness top to bottomPower-awareness top to bottom

Page 10: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 10

Improved Branch Predictor· Pentium® M employs best-in-class branch prediction

– Bimodal predictor, Global predictor, Loop detector– Indirect branch predictor

· Reduces number of wrong instructions executed– Saves energy spent executing wrong instructions

· Loop predictor– Analyzes branches for loop behavior

Moving in one direction (taken or NT) a fixed number of times

Ended with a single movement in the opposite direction

– Detect exact loop count – Loop predicted accurately

PredictionLimitCount

=

+1

0

Page 11: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 11

Indirect Branch Predictor· Indirect jumps are widely used in object-oriented code (C++, Java)

· Targets are data dependent– Resolved at execution high misprediction penalty

· Initially, allocate indirect branch only in target array (TA)– If TA mispredicts allocate in iTA according to global history

Multiple targets allocated for a given branch– Indirects with a single target predicted by TA, saving iTA space

· Use iTA if TA indicates indirect branch + iTA hits

Target Array

iTA

Branch IP

Predicted Target

hitindirect branch

hit

target

HIT

global history

target

Page 12: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 12

Dedicated Stack Engine· PUSH, POP, CALL, RET update ESP (add or sub an offset)

– Use a dedicated add uop

· Track the ESP offset at the front-end– ID maintains offset in ESP_delta (+/- Osize) – Eliminates need for uops updating ESP– Patch displacements of stack operations

· In some cases, ESP actual value is needed– For example: add eax, esp, 3– A sync uop is inserted before the instruction

if ESP_delta != 0 ESP = ESP + ESP_delta

– Reset ESP_delta

· ESP_delta recovered on jump misprediction

Page 13: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 13

ESP Tracking Example

ESP = SUB ESP, 8

STORE [ESP-8], EBX

STORE [ESP-4], EAX

PUSH eax

PUSH ebx

INC eax

INC esp

ESP = ESP - 4

STORE [ESP], EAX

ESP = ESP - 4

STORE [ESP], EBX

EAX = ADD EAX, 1

ESP = ADD ESP, 1

Δ = Δ - 4

Δ = 0

Δ = - 4

Δ = - 8Δ = Δ - 4

EAX = ADD EAX, 1

ESP = ADD ESP, 1

Δ = - 8

Δ = - 8

Δ = 0

Δ = 0

Sync ESP !

Page 14: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 14

Uop Fusion· The Instruction Decoder breaks an instruction into uops

– A conventional uop consists of a single operation operating on two sources

· An instruction requires multiple uops when– the instruction operates on more than two sources, or – the nature of the operation requires a sequence of operations

· Uop fusion: in some cases the decoder fuses 2 uops into one uop– A short field added to the uop to support fusing of specific uop pairs

· Uop fusion reduces the number of uops by 10%– Increases performance by effectively widening rename, and retire bandwidth – More instructions can be decode by all decoders

· The same task is accomplished by processing fewer uops– Decreases the energy required to complete a given task

Page 15: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 15

A 2-uop Load-Op

Decoder

add eax,[ebp+4*esi+8]

Scheduler

LD

MEU ALU

OP

LD

OP

LD OP

tmp=load[ebp+4*esi+8]

eax = eax + tmp

Load-op with 3 reg. operands

Decoded into 2 uops LD: read data from mem OP: reg ← reg op data

The LD and OP are inherently serial

OP dispatched only when LD completes

Page 16: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 16

A 1-uop Load-Op

Decoder

add eax,[ebp+4*esi+8]

Scheduler

Cache

LD + OP

LD + OP

LD

ALUOP

eax = eax + load[ebp+4*esi+8]

Decoded into 1 uopFused uops has a 3rd source – new field in uop holds index registerIncrease decode BW

Increase alloc BW andROB/RS effective size

Dispatched twiceOP dispatched after LD

fused uop retires after both LD&OP complete Increase retire BW

Page 17: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 17

Enhanced SpeedStep™ Technology· The “Basic” SpeedStep™ Technology had

– 2 operating points – Non-transparent switch

· The “Enhanced” version provides– Multi voltage/frequency operating points. The Pentium M processor 1.6GHz

operation ranges: From 600MHz @ 0.956V To 1.6GHz @ 1.484V

– Transparent switch– Frequent switches

· Benefits– Higher power efficiency

2.7X lower frequency 2X performance loss >2X energy gain

– Outstanding battery life– Excellent thermal mgmt.

Voltage, Frequency, Power

0.0

0.4

0.8

1.2

1.6

2.0

2.4

2.8

3.2

3.6

4.0

0.8 1.0 1.2 1.4 1.6Voltage (Volt)

Fre

qu

ency

( GH

z

)

0

2

4

6

8

10

12

14

16

18

20

Ty

pic

al P

ow

er

( Wa

tts

)

Freq (GHz)

Power (Watts)

2.7X

6.1X

Efficiency ratio = 2.3

Page 18: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 18

Trace Cache(Pentium® 4 Processor)

Page 19: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 19

Trace Cache

· Decoding several IA-32 inst/clock at high frequency is difficult– Instructions have a variable length and have many different options– Takes several pipe-stages

Adds to the branch mis-prediction penalty

· Trace-cache: cache uops of previously decoded instructions– Decoding is only needed for instructions that miss the TC

· The TC is the primary (L1) instruction cache – Holds 12K uops– 8-way set associative with LRU replacement

· The TC has its own branch predictor (Trace BTB)– Predicts branches that hit in the TC– Directs where instruction fetching needs to go next in the TC

Page 20: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 20

Traces· Instruction caches fetch bandwidth is limited to a basic blocks

– Cannot provide instructions across a taken branch in the same cycle

· The TC builds traces: program-ordered sequences of uops– Allows the target of a branch to be included in the same TC line as the branch

itself

· Traces have variable length– Broken into trace lines, six uops per trace line– There can be many trace lines in a single trace

Jump into the line

Jump out of the line

jmp

jmp jmpjmpjmp

Page 21: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 21

Hyper Threading Technology(Pentium® 4 Processor )

Based onHyper-Threading Technology Architecture and Micro-architecture

Intel Technology Journal

Page 22: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 22

Thread-Level Parallelism

· Multiprocessor systems have been used for many years– There are known techniques to exploit multiprocessors

· Software trends– Applications consist of multiple threads or processes that can be

executed in parallel on multiple processors

· Thread-level parallelism (TLP) – threads can be from– the same application– different applications running simultaneously– operating system services

· Increasing single thread performance becomes harder– and is less and less power efficient

· Chip Multi-Processing (CMP)– Two (or more) processors are put on a single die

Page 23: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 23

Multi-Threading· Multi-threading: a single processor executes multiple threads· Time-slice multithreading

– The processor switches between software threads after a fixed period– Can effectively minimize the effects of long latencies to memory

· Switch-on-event multithreading – Switch threads on long latency events such as cache misses – Works well for server applications that have many cache misses

· A deficiency of both time-slice MT and switch-on-event MT– They do not cover for branch mis-predictions and long dependencies

· Simultaneous multi-threading (SMT)– Multiple threads execute on a single processor simultaneously w/o switching– Makes the most effective use of processor resources

Maximizes performance vs. transistor count and power

Page 24: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 24

Hyper-threading (HT) Technology· HT is SMT

– Makes a single processor appear as 2 logical processors = threads

· Each thread keeps a its own architectural state– General-purpose registers– Control and machine state registers

· Each thread has its own interrupt controller – Interrupts sent to a specific logical processor are handled only by it

· OS views logical processors (threads) as physical processors– Schedule threads to logical processors as in a multiprocessor system

· From a micro-architecture perspective– Thread share a single set of physical resources

caches, execution units, branch predictors, control logic, and buses

Page 25: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 25

Two Important Goals

· When one thread is stalled the other thread can continue to make progress– Independent progress ensured by either

Partitioning buffering queues and limiting the number of entries each thread can use

Duplicating buffering queues

· A single active thread running on a processor with HT runs at the same speed as without HT – Partitioned resources are recombined when only one thread is active

Page 26: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 26

Front End· Each thread manages its own next-instruction-pointer· Threads arbitrate TC access every cycle (Ping-Pong)

– If both want to access the TC – access granted in alternating cycles – If one thread is stalled, the other thread gets the full TC bandwidth

· TC entries are tagged with thread-ID – Dynamically allocated as needed– Allows one logical processor to have more entries than the other

TC Hit TC Miss

Page 27: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 27

Front End (cont.)

· Branch prediction structures are either duplicated or shared– The return stack buffer is duplicated – Global history is tracked for each thread– The large global history array is a shared

Entries are tagged with a logical processor ID

· Each thread has its own ITLB

· Both threads share the same decoder logic– if only one needs the decode logic, it gets the full decode bandwidth – The state needed by the decodes is duplicated

· Uop queue is hard partitioned– Allows both logical processors to make independent forward progress

regardless of FE stalls (e.g., TC miss) or EXE stalls

Page 28: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 28

Out-of-order Execution· ROB and MOB are hard partitioned

– Enforce fairness and prevent deadlocks· Allocator ping-pongs between the thread

– A thread is selected for allocation if Its uop-queue is not empty its buffers (ROB, RS) are not full It is the thread’s turn, or the other thread cannot be selected

Page 29: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 29

Out-of-order Execution (cont)· Registers renamed to a shared physical register pool

– Store results until retirement

· After allocation and renaming uops are placed in one of 2 Qs– Memory instruction queue and general instruction queue

The two queues are hard partitioned

– Uops are read from the Q’s and sent to the scheduler using ping-pong

· The schedulers are oblivious to threads – Schedule uops based on dependencies and exe. resources availability

Regardless of their thread

– Uops from the two threads can be dispatched in the same cycle– To avoid deadlock and ensure fairness

Limit the number of active entries a thread can have in each scheduler’s queue

· Forwarding logic compares physical register numbers– Forward results to other uops without thread knowledge

Page 30: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 30

Out-of-order Execution (cont)

· Memory is largely oblivious– L1 Data Cache, L2 Cache, L3 Cache are thread oblivious

All use physical addresses– DTLB is shared

Each DTLB entry includes a thread ID as part of the tag

· Retirement ping-pongs between threads– If one thread is not ready to retire uops all retirement bandwidth is

dedicated to the other thread

Page 31: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 31

Single-task And Multi-task Modes· MT-mode (Multi-task mode)

– Two active threads, with some resources partitioned as described earlier

· ST-mode (Single-task mode)– There are two flavors of ST-mode

single-task thread 0 (ST0) – only thread 0 is active single-task thread 1 (ST1) – only thread 1 is active

– Resources that were partitioned in MT-mode are re-combined to give the single active logical processor use of all of the resources

· Moving the processor from between modes

ST0 ST1

MTThread 1 executes HALT

LowPower Thread 1 executes HALT

Thread 0 executes HALT

Thread 0 executes HALT

Interrupt

Page 32: Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture 2010 – Advanced Topics 32

Operating System And Applications

· An HT processor appears to the OS and application SW as 2 processors– The OS manages logical processors as it does physical processors

The OS should implement two optimizations:

· Use HALT if only one logical processor is active– Allows the processor to transition to either the ST0 or ST1 mode – Otherwise the OS would execute on the idle logical processor a sequence of

instructions that repeatedly checks for work to do – This so-called “idle loop” can consume significant execution resources that

could otherwise be used by the other active logical processor

· On a multi-processor system, – Schedule threads to logical processors on different physical processors

before scheduling multiple threads to the same physical processor– Allows SW threads to use different physical resources when possible