Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

  • Published on

  • View

  • Download

Embed Size (px)


  • Slide 1
  • Computer Architecture 2010 Advanced Topics 1 Computer Architecture Advanced Topics
  • Slide 2
  • Computer Architecture 2010 Advanced Topics 2 Pentium M Processor
  • Slide 3
  • Computer Architecture 2010 Advanced Topics 3 From Pentium M Processor Intels 1st processor designed for mobility Achieve best performance at given power and thermal constraints Achieve longest battery life BaniasDothanSandy Bridge transistors77M140M995M / 506M (55M / Core) process130nm90nm32nm Die size84 mm 2 85mm 2 216mm 2 4C+GT2 131mm 2 - 2C +GT1 Peak power24.5 watts21 watts17 90 W Freq1.7 GHz2.1GHz2.8 3.8 4.4GHz L1 cache32KB I$ + 32KB D$ L2 cache1MB2MB256K (per core) + 3-8MB L3 src:
  • Slide 4
  • Computer Architecture 2010 Advanced Topics 4 Example Standby Bridge Use Moors Law and process improvements to: Power/Performance Integration Reduce communication Reduce Latencies (cost in complexity ) More Performance and Efficiency via : Speed Step Memory Hierarchy Multi-Core Multi-Thread Out-of-Order Execution Predictors Multi-Operand (vector) Instructions Custom Processing src:
  • Slide 5
  • Computer Architecture 2010 Advanced Topics 5 Performance per Watt Mobiles smaller form-factor decreases power budget Power generates heat, which must be dissipated to keep transistors within allowed temperature Limits the processors peak power consumption Change the target Old target: get max performance New target: get max performance at a given power envelope Performance per Watt Performance via frequency increase Power = CV 2 f, but increasing f also requires increasing V X% performance costs 3X% power Assume performance linear with frequency A power efficient feature better than 1:3 performance : power Otherwise it is better to just increase frequency All Banias u-arch features (aimed at performance) are power efficient
  • Slide 6
  • Computer Architecture 2010 Advanced Topics 6 Higher Performance vs. Longer Battery Life Processor average power is
  • Computer Architecture 2010 Advanced Topics 17 Enhanced SpeedStep Technology The Basic SpeedStep Technology had 2 operating points Non-transparent switch The Enhanced version provides Multi voltage/frequency operating points. The Pentium M processor 1.6GHz operation ranges: From 600MHz @ 0.956V To 1.6GHz @ 1.484V Transparent switch Frequent switches Benefits Higher power efficiency 2.7X lower frequency 2X performance loss >2X energy gain Outstanding battery life Excellent thermal mgmt. 2.7X6.1X Efficiency ratio = 2.3
  • Slide 18
  • Computer Architecture 2010 Advanced Topics 18 Trace Cache (Pentium 4 Processor)
  • Slide 19
  • Computer Architecture 2010 Advanced Topics 19 Trace Cache Decoding several IA-32 inst/clock at high frequency is difficult Instructions have a variable length and have many different options Takes several pipe-stages Adds to the branch mis-prediction penalty Trace-cache: cache uops of previously decoded instructions Decoding is only needed for instructions that miss the TC The TC is the primary (L1) instruction cache Holds 12K uops 8-way set associative with LRU replacement The TC has its own branch predictor (Trace BTB) Predicts branches that hit in the TC Directs where instruction fetching needs to go next in the TC
  • Slide 20
  • Computer Architecture 2010 Advanced Topics 20 Traces Instruction caches fetch bandwidth is limited to a basic blocks Cannot provide instructions across a taken branch in the same cycle The TC builds traces: program-ordered sequences of uops Allows the target of a branch to be included in the same TC line as the branch itself Traces have variable length Broken into trace lines, six uops per trace line There can be many trace lines in a single trace Jump into the line Jump out of the linejmp jmp jmpjmpjmp
  • Slide 21
  • Computer Architecture 2010 Advanced Topics 21 Hyper Threading Technology (Pentium 4 Processor ) Based on Hyper-Threading Technology Architecture and Micro-architecture Intel Technology Journal
  • Slide 22
  • Computer Architecture 2010 Advanced Topics 22 Thread-Level Parallelism Multiprocessor systems have been used for many years There are known techniques to exploit multiprocessors Software trends Applications consist of multiple threads or processes that can be executed in parallel on multiple processors Thread-level parallelism (TLP) threads can be from the same application different applications running simultaneously operating system services Increasing single thread performance becomes harder and is less and less power efficient Chip Multi-Processing (CMP) Two (or more) processors are put on a single die
  • Slide 23
  • Computer Architecture 2010 Advanced Topics 23 Multi-Threading Multi-threading: a single processor executes multiple threads Time-slice multithreading The processor switches between software threads after a fixed period Can effectively minimize the effects of long latencies to memory Switch-on-event multithreading Switch threads on long latency events such as cache misses Works well for server applications that have many cache misses A deficiency of both time-slice MT and switch-on-event MT They do not cover for branch mis-predictions and long dependencies Simultaneous multi-threading (SMT) Multiple threads execute on a single processor simultaneously w/o switching Makes the most effective use of processor resources Maximizes performance vs. transistor count and power
  • Slide 24
  • Computer Architecture 2010 Advanced Topics 24 Hyper-threading (HT) Technology HT is SMT Makes a single processor appear as 2 logical processors = threads Each thread keeps a its own architectural state General-purpose registers Control and machine state registers Each thread has its own interrupt controller Interrupts sent to a specific logical processor are handled only by it OS views logical processors (threads) as physical processors Schedule threads to logical processors as in a multiprocessor system From a micro-architecture perspective Thread share a single set of physical resources caches, execution units, branch predictors, control logic, and buses
  • Slide 25
  • Computer Architecture 2010 Advanced Topics 25 Two Important Goals When one thread is stalled the other thread can continue to make progress Independent progress ensured by either Partitioning buffering queues and limiting the number of entries each thread can use Duplicating buffering queues A single active thread running on a processor with HT runs at the same speed as without HT Partitioned resources are recombined when only one thread is active
  • Slide 26
  • Computer Architecture 2010 Advanced Topics 26 Front End Each thread manages its own next-instruction-pointer Threads arbitrate TC access every cycle (Ping-Pong) If both want to access the TC access granted in alternating cycles If one thread is stalled, the other thread gets the full TC bandwidth TC entries are tagged with thread-ID Dynamically allocated as needed Allows one logical processor to have more entries than the other TC Hit TC Miss
  • Slide 27
  • Computer Architecture 2010 Advanced Topics 27 Front End (cont.) Branch prediction structures are either duplicated or shared The return stack buffer is duplicated Global history is tracked for each thread The large global history array is a shared Entries are tagged with a logical processor ID Each thread has its own ITLB Both threads share the same decoder logic if only one needs the decode logic, it gets the full decode bandwidth The state needed by the decodes is duplicated Uop queue is hard partitioned Allows both logical processors to make independent forward progress regardless of FE stalls (e.g., TC miss) or EXE stalls
  • Slide 28
  • Computer Architecture 2010 Advanced Topics 28 Out-of-order Execution ROB and MOB are hard partitioned Enforce fairness and prevent deadlocks Allocator ping-pongs between the thread A thread is selected for allocation if Its uop-queue is not empty its buffers (ROB, RS) are not full It is the threads turn, or the other thread cannot be selected
  • Slide 29
  • Computer Architecture 2010 Advanced Topics 29 Out-of-order Execution (cont) Registers renamed to a shared physical register pool Store results until retirement After allocation and renaming uops are placed in one of 2 Qs Memory instruction queue and general instruction queue The two queues are hard partitioned Uops are read from the Qs and sent to the scheduler using ping-pong The schedulers are oblivious to threads Schedule uops based on dependencies and exe. resources availability Regardless of their thread Uops from the two threads can be dispatched in the same cycle To avoid deadlock and ensure fairness Limit the number of active entries a thread can have in each schedulers queue Forwarding logic compares physical register numbers Forward results to other uops without thread knowledge
  • Slide 30
  • Computer Architecture 2010 Advanced Topics 30 Out-of-order Execution (cont) Memory is largely oblivious L1 Data Cache, L2 Cache, L3 Cache are thread oblivious All use physical addresses DTLB is shared Each DTLB entry includes a thread ID as part of the tag Retirement ping-pongs between threads If one thread is not ready to retire uops all retirement bandwidth is dedicated to the other thread
  • Slide 31
  • Computer Architecture 2010 Advanced Topics 31 Single-task And Multi-task Modes MT-mode (Multi-task mode) Two active threads, with some resources partitioned as described earlier ST-mode (Single-task mode) There are two flavors of ST-mode single-task thread 0 (ST0) only thread 0 is active single-task thread 1 (ST1) only thread 1 is active Resources that were partitioned in MT-mode are re-combined to give the single active logical processor use of all of the resources Moving the processor from between modes ST0 ST1 MT Thread 1 executes HALT Low Power Thread 1 executes HALT Thread 0 executes HALT Interrupt
  • Slide 32
  • Computer Architecture 2010 Advanced Topics 32 Operating System And Applications An HT processor appears to the OS and application SW as 2 processors The OS manages logical processors as it does physical processors The OS should implement two optimizations: Use HALT if only one logical processor is active Allows the processor to transition to either the ST0 or ST1 mode Otherwise the OS would execute on the idle logical processor a sequence of instructions that repeatedly checks for work to do This so-called idle loop can consume significant execution resources that could otherwise be used by the other active logical processor On a multi-processor system, Schedule threads to logical processors on different physical processors before scheduling multiple threads to the same physical processor Allows SW threads to use different physical resources when possible