of 30 /30
Computer Architecture 2008 – Ad 1 Computer Architecture Advanced Topics

Computer Architecture 2008 – Advanced Topics 1 Computer Architecture Advanced Topics

Embed Size (px)

Text of Computer Architecture 2008 – Advanced Topics 1 Computer Architecture Advanced Topics

Achieve best performance at given power and thermal constraints
Achieve longest battery life
Mobile’s smaller form-factor decreases power budget
Power generates heat, which must be dissipated to keep transistors within allowed temperature
Limits the processor’s peak power consumption
Change the target
New target: get max performance at a given power envelope
Performance per Watt
X% performance costs 3X% power
Assume performance linear with frequency
A power efficient feature – better than 1:3 performance : power
Otherwise it is better to just increase frequency
All Banias u-arch features (aimed at performance) are power efficient
Computer Architecture 2008 – Advanced Topics
The processor reduces power in periods of low processor activity
The processor enters lower power states in idle periods
Average power includes low-activity periods and idle-time
Typical: 1W – 3W
Typical: 20W – 100W
Optimize for battery life when idle
Active power: used to switch transistors
Static power: leakage of transistors under voltage
Static power is a function of
Number of transistors and their type
Operating voltage
Die temperature
Pentium® M reduces static power consumption
The L2 cache is built with low-leakage transistors (2/3 of the die transistors)
Low-leakage transistors are slower, increasing cache access latency
The significant power saved justifies the small performance loss
Enhanced SpeedStep® technology
Computer Architecture 2008 – Advanced Topics
SSE instructions reduce the number of instructions architecturally
Less uops per instruction
efficient bus
Enhanced SpeedStep® technology
Indirect branch predictor
Loop predictor
Moving in one direction (taken or NT)
a fixed number of times
Ended with a single movement
in the opposite direction
Detect exact loop count
Indirect jumps are widely used in object-oriented code (C++, Java)
Targets are data dependent
Initially, allocate indirect branch only in target array (TA)
If TA mispredicts allocate in iTA according to global history
Multiple targets allocated for a given branch
Indirects with a single target predicted by TA, saving iTA space
Use iTA if TA indicates indirect branch + iTA hits
Target Array
Dedicated Stack Engine
PUSH, POP, CALL, RET update ESP (add or sub an offset)
Use a dedicated add uop
Track the ESP offset at the front-end
ID maintains offset in ESP_delta (+/- Osize)
Eliminates need for uops updating ESP
Patch displacements of stack operations
In some cases, ESP actual value is needed
For example: add eax, esp, 3
A sync uop is inserted before the instruction
if ESP_delta != 0
ESP = ESP + ESP_delta
The Instruction Decoder breaks an instruction into uops
A conventional uop consists of a single operation operating on two sources
An instruction requires multiple uops when
the instruction operates on more than two sources, or
the nature of the operation requires a sequence of operations
Uop fusion: in some cases the decoder fuses 2 uops into one uop
A short field added to the uop to support fusing of specific uop pairs
Uop fusion reduces the number of uops by 10%
Increases performance by effectively widening rename, and retire bandwidth
More instructions can be decode by all decoders
The same task is accomplished by processing fewer uops
Decreases the energy required to complete a given task
Computer Architecture 2008 – Advanced Topics
eax = eax + tmp
Load-op with 3 reg. operands
Decoded into 2 uops LD: read data from mem OP: reg ← reg op data
The LD and OP are inherently serial
OP dispatched only when LD completes
Decoded into 1 uop
Fused uops has a 3rd source – new field in uop holds index register
Increase decode BW
Increase retire BW
2 operating points
From 600MHz @ 0.956V
To 1.6GHz @ 1.484V
Instructions have a variable length and have many different options
Takes several pipe-stages
Trace-cache: cache uops of previously decoded instructions
Decoding is only needed for instructions that miss the TC
The TC is the primary (L1) instruction cache
Holds 12K uops
The TC has its own branch predictor (Trace BTB)
Predicts branches that hit in the TC
Directs where instruction fetching needs to go next in the TC
Computer Architecture 2008 – Advanced Topics
Instruction caches fetch bandwidth is limited to a basic blocks
Cannot provide instructions across a taken branch in the same cycle
The TC builds traces: program-ordered sequences of uops
Allows the target of a branch to be included in the same TC line as the branch itself
Traces have variable length
There can be many trace lines in a single trace
Intel Technology Journal
There are known techniques to exploit multiprocessors
Software trends
Applications consist of multiple threads or processes that can be executed in parallel on multiple processors
Thread-level parallelism (TLP) – threads can be from
the same application
and is less and less power efficient
Chip Multi-Processing (CMP)
Two (or more) processors are put on a single die
Computer Architecture 2008 – Advanced Topics
Time-slice multithreading
The processor switches between software threads after a fixed period
Can effectively minimize the effects of long latencies to memory
Switch-on-event multithreading
Switch threads on long latency events such as cache misses
Works well for server applications that have many cache misses
A deficiency of both time-slice MT and switch-on-event MT
They do not cover for branch mis-predictions and long dependencies
Simultaneous multi-threading (SMT)
Multiple threads execute on a single processor simultaneously w/o switching
Makes the most effective use of processor resources
Maximizes performance vs. transistor count and power
Computer Architecture 2008 – Advanced Topics
Makes a single processor appear as 2 logical processors = threads
Each thread keeps a its own architectural state
General-purpose registers
Each thread has its own interrupt controller
Interrupts sent to a specific logical processor are handled only by it
OS views logical processors (threads) as physical processors
Schedule threads to logical processors as in a multiprocessor system
From a micro-architecture perspective
caches, execution units, branch predictors, control logic, and buses
Computer Architecture 2008 – Advanced Topics
Two Important Goals
When one thread is stalled the other thread can continue to make progress
Independent progress ensured by either
Partitioning buffering queues and limiting the number of entries each thread can use
Duplicating buffering queues
A single active thread running on a processor with HT runs at the same speed as without HT
Partitioned resources are recombined when only one thread is active
Computer Architecture 2008 – Advanced Topics
Threads arbitrate TC access every cycle (Ping-Pong)
If both want to access the TC – access granted in alternating cycles
If one thread is stalled, the other thread gets the full TC bandwidth
TC entries are tagged with thread-ID
Dynamically allocated as needed
Allows one logical processor to have more entries than the other
TC Hit
TC Miss
The return stack buffer is duplicated
Global history is tracked for each thread
The large global history array is a shared
Entries are tagged with a logical processor ID
Each thread has its own ITLB
Both threads share the same decoder logic
if only one needs the decode logic, it gets the full decode bandwidth
The state needed by the decodes is duplicated
Uop queue is hard partitioned
Allows both logical processors to make independent forward progress regardless of FE stalls (e.g., TC miss) or EXE stalls
Computer Architecture 2008 – Advanced Topics
Enforce fairness and prevent deadlocks
Allocator ping-pongs between the thread
A thread is selected for allocation if
Its uop-queue is not empty
its buffers (ROB, RS) are not full
It is the thread’s turn, or the other thread cannot be selected
Computer Architecture 2008 – Advanced Topics
Store results until retirement
After allocation and renaming uops are placed in one of 2 Qs
Memory instruction queue and general instruction queue
The two queues are hard partitioned
Uops are read from the Q’s and sent to the scheduler using ping-pong
The schedulers are oblivious to threads
Schedule uops based on dependencies and exe. resources availability
Regardless of their thread
Uops from the two threads can be dispatched in the same cycle
To avoid deadlock and ensure fairness
Limit the number of active entries a thread can have in each scheduler’s queue
Forwarding logic compares physical register numbers
Forward results to other uops without thread knowledge
Computer Architecture 2008 – Advanced Topics
L1 Data Cache, L2 Cache, L3 Cache are thread oblivious
All use physical addresses
DTLB is shared
Each DTLB entry includes a thread ID as part of the tag
Retirement ping-pongs between threads
If one thread is not ready to retire uops all retirement bandwidth is dedicated to the other thread
Computer Architecture 2008 – Advanced Topics
Two active threads, with some resources partitioned as described earlier
ST-mode (Single-task mode)
single-task thread 0 (ST0) – only thread 0 is active
single-task thread 1 (ST1) – only thread 1 is active
Resources that were partitioned in MT-mode are re-combined to give the single active logical processor use of all of the resources
Moving the processor from between modes
Operating System And Applications
An HT processor appears to the OS and application SW as 2 processors
The OS manages logical processors as it does physical processors
The OS should implement two optimizations:
Use HALT if only one logical processor is active
Allows the processor to transition to either the ST0 or ST1 mode
Otherwise the OS would execute on the idle logical processor a sequence of instructions that repeatedly checks for work to do
This so-called “idle loop” can consume significant execution resources that could otherwise be used by the other active logical processor
On a multi-processor system,
Schedule threads to logical processors on different physical processors before scheduling multiple threads to the same physical processor
Allows SW threads to use different physical resources when possible