of 32 /32
Computer Architecture 2010 – Advanced Topics 1 Computer Architecture Advanced Topics

Computer Architecture Advanced Topics

  • Author

  • View

  • Download

Embed Size (px)


Computer Architecture Advanced Topics . Pentium ® M Processor. From Pentium ® M Processor. Intel’s 1st processor designed for mobility Achieve best performance at given power and thermal constraints Achieve longest battery life. src : http :// www.anandtech.com. Example Standby Bridge. - PowerPoint PPT Presentation

Text of Computer Architecture Advanced Topics


Computer Architecture Advanced Topics

Computer Architecture 2010 Advanced Topics #Pentium M Processor Computer Architecture 2010 Advanced Topics #From Pentium M ProcessorIntels 1st processor designed for mobilityAchieve best performance at given power and thermal constraintsAchieve longest battery lifeBaniasDothanSandy Bridgetransistors77M140M995M / 506M (55M / Core)process130nm 90nm32nmDie size84 mm285mm2216mm2 4C+GT2131mm2 - 2C +GT1Peak power24.5 watts21 watts17 90 WFreq1.7 GHz2.1GHz2.8 3.8 4.4GHzL1 cache32KB I$ + 32KB D$32KB I$ + 32KB D$32KB I$ + 32KB D$L2 cache1MB2MB256K (per core) +3-8MB L3src: http://www.anandtech.com Computer Architecture 2010 Advanced Topics #Example Standby Bridge

Use Moors Law and process improvements to:Power/PerformanceIntegration Reduce communicationReduce Latencies(cost in complexity)

More Performance and Efficiency via :Speed StepMemory HierarchyMulti-CoreMulti-ThreadOut-of-Order ExecutionPredictorsMulti-Operand (vector) InstructionsCustom Processing

src: http://www.anandtech.com Computer Architecture 2010 Advanced Topics #Performance per WattMobiles smaller form-factor decreases power budgetPower generates heat, which must be dissipated to keep transistors within allowed temperatureLimits the processors peak power consumptionChange the targetOld target: get max performanceNew target: get max performance at a given power envelopePerformance per Watt Performance via frequency increasePower = CV2f, but increasing f also requires increasing VX% performance costs 3X% powerAssume performance linear with frequencyA power efficient feature better than 1:3 performance : powerOtherwise it is better to just increase frequencyAll Banias u-arch features (aimed at performance) are power efficient Computer Architecture 2010 Advanced Topics #Higher Performance vs.Longer Battery LifeProcessor average power is 2X energy gainOutstanding battery lifeExcellent thermal mgmt.

2.7X6.1XEfficiency ratio = 2.3 Computer Architecture 2010 Advanced Topics #Trace Cache(Pentium 4 Processor) Computer Architecture 2010 Advanced Topics #Trace CacheDecoding several IA-32 inst/clock at high frequency is difficultInstructions have a variable length and have many different optionsTakes several pipe-stagesAdds to the branch mis-prediction penalty

Trace-cache: cache uops of previously decoded instructionsDecoding is only needed for instructions that miss the TC

The TC is the primary (L1) instruction cache Holds 12K uops8-way set associative with LRU replacement

The TC has its own branch predictor (Trace BTB)Predicts branches that hit in the TCDirects where instruction fetching needs to go next in the TC

Computer Architecture 2010 Advanced Topics #TracesInstruction caches fetch bandwidth is limited to a basic blocks Cannot provide instructions across a taken branch in the same cycle

The TC builds traces: program-ordered sequences of uopsAllows the target of a branch to be included in the same TC line as the branch itself

Traces have variable lengthBroken into trace lines, six uops per trace lineThere can be many trace lines in a single traceJump into the lineJump out of the linejmpjmpjmpjmpjmp Computer Architecture 2010 Advanced Topics #Hyper Threading Technology(Pentium 4 Processor )Based onHyper-Threading Technology Architecture and Micro-architectureIntel Technology Journal

Computer Architecture 2010 Advanced Topics #Thread-Level ParallelismMultiprocessor systems have been used for many yearsThere are known techniques to exploit multiprocessors

Software trendsApplications consist of multiple threads or processes that can be executed in parallel on multiple processors

Thread-level parallelism (TLP) threads can be fromthe same applicationdifferent applications running simultaneouslyoperating system services

Increasing single thread performance becomes harderand is less and less power efficient

Chip Multi-Processing (CMP)Two (or more) processors are put on a single die Computer Architecture 2010 Advanced Topics #Multi-ThreadingMulti-threading: a single processor executes multiple threadsTime-slice multithreading The processor switches between software threads after a fixed periodCan effectively minimize the effects of long latencies to memory

Switch-on-event multithreading Switch threads on long latency events such as cache misses Works well for server applications that have many cache misses

A deficiency of both time-slice MT and switch-on-event MTThey do not cover for branch mis-predictions and long dependencies

Simultaneous multi-threading (SMT)Multiple threads execute on a single processor simultaneously w/o switchingMakes the most effective use of processor resourcesMaximizes performance vs. transistor count and power Computer Architecture 2010 Advanced Topics #Hyper-threading (HT) TechnologyHT is SMTMakes a single processor appear as 2 logical processors = threads

Each thread keeps a its own architectural stateGeneral-purpose registersControl and machine state registers

Each thread has its own interrupt controller Interrupts sent to a specific logical processor are handled only by it

OS views logical processors (threads) as physical processorsSchedule threads to logical processors as in a multiprocessor system

From a micro-architecture perspectiveThread share a single set of physical resourcescaches, execution units, branch predictors, control logic, and buses Computer Architecture 2010 Advanced Topics #Two Important GoalsWhen one thread is stalled the other thread can continue to make progressIndependent progress ensured by eitherPartitioning buffering queues and limiting the number of entries each thread can useDuplicating buffering queues

A single active thread running on a processor with HT runs at the same speed as without HT Partitioned resources are recombined when only one thread is active Computer Architecture 2010 Advanced Topics #Front EndEach thread manages its own next-instruction-pointerThreads arbitrate TC access every cycle (Ping-Pong)If both want to access the TC access granted in alternating cycles If one thread is stalled, the other thread gets the full TC bandwidthTC entries are tagged with thread-ID Dynamically allocated as neededAllows one logical processor to have more entries than the other

TC HitTC Miss

Computer Architecture 2010 Advanced Topics #Front End (cont.)Branch prediction structures are either duplicated or sharedThe return stack buffer is duplicated Global history is tracked for each threadThe large global history array is a sharedEntries are tagged with a logical processor ID

Each thread has its own ITLB

Both threads share the same decoder logicif only one needs the decode logic, it gets the full decode bandwidth The state needed by the decodes is duplicated

Uop queue is hard partitionedAllows both logical processors to make independent forward progress regardless of FE stalls (e.g., TC miss) or EXE stalls Computer Architecture 2010 Advanced Topics #Out-of-order ExecutionROB and MOB are hard partitionedEnforce fairness and prevent deadlocksAllocator ping-pongs between the threadA thread is selected for allocation ifIts uop-queue is not emptyits buffers (ROB, RS) are not fullIt is the threads turn, or the other thread cannot be selected

Computer Architecture 2010 Advanced Topics #Out-of-order Execution (cont)Registers renamed to a shared physical register poolStore results until retirementAfter allocation and renaming uops are placed in one of 2 QsMemory instruction queue and general instruction queueThe two queues are hard partitionedUops are read from the Qs and sent to the scheduler using ping-pongThe schedulers are oblivious to threads Schedule uops based on dependencies and exe. resources availabilityRegardless of their threadUops from the two threads can be dispatched in the same cycleTo avoid deadlock and ensure fairnessLimit the number of active entries a thread can have in each schedulers queueForwarding logic compares physical register numbersForward results to other uops without thread knowledge Computer Architecture 2010 Advanced Topics #Out-of-order Execution (cont)Memory is largely obliviousL1 Data Cache, L2 Cache, L3 Cache are thread obliviousAll use physical addressesDTLB is sharedEach DTLB entry includes a thread ID as part of the tag

Retirement ping-pongs between threadsIf one thread is not ready to retire uops all retirement bandwidth is dedicated to the other thread Computer Architecture 2010 Advanced Topics #Single-task And Multi-task ModesMT-mode (Multi-task mode)Two active threads, with some resources partitioned as described earlierST-mode (Single-task mode)There are two flavors of ST-modesingle-task thread 0 (ST0) only thread 0 is activesingle-task thread 1 (ST1) only thread 1 is activeResources that were partitioned in MT-mode are re-combined to give the single active logical processor use of all of the resourcesMoving the processor from between modesST0ST1MTThread 1 executes HALTLowPowerThread 1 executes HALTThread 0 executes HALTThread 0 executes HALTInterrupt Computer Architecture 2010 Advanced Topics #Operating System And ApplicationsAn HT processor appears to the OS and application SW as 2 processorsThe OS manages logical processors as it does physical processors

The OS should implement two optimizations:Use HALT if only one logical processor is activeAllows the processor to transition to either the ST0 or ST1 mode Otherwise the OS would execute on the idle logical processor a sequence of instructions that repeatedly checks for work to do This so-called idle loop can consume significant execution resources that could otherwise be used by the other active logical processor

On a multi-processor system, Schedule threads to logical processors on different physical processors before scheduling multiple threads to the same physical processorAllows SW threads to use different physical resources when possible Computer Architecture 2010 Advanced Topics #PredictionLimitCount=+10