Last lecture Some misc. stuff An older real processor Class review/overview

Last lecture

Some misc. stuffAn older real processor

Class review/overview.

Misc. Status issues• Saturday (4/18) @9pm

– Project due.• Sunday 4/19:

– Review session 4-6pm (DOW 1017)– Pizza House 7-9pm (RSVP request posted, please respond by

tomorrow)• Tuesday (4/21):

– Project talk groups• Sign up if you haven’t yet done so!

• Tuesday (4/21) – Written report due (via e-mail) @9pm

• Wednesday (4/22) – Review session 3-4:30pm (1500 EECS)

• Thursday (4/23)– Exam at 4pm in our classroom. That’s 4pm sharp, not Michigan time!

Office hour changes

• Friday 4/17 moved to 10:30am-12 in my office (4632 Beyster)

• Tuesday 4/21 office hours moved to 3:15-4:45 in my office

Stuff still to do

• Oral report– Don’t forget to be there for the whole hour (or

longer if your group is during class time)

– PowerPoint or other slides• Either bring portable or USB stick

• Written report– Due 9pm Tuesday via e-mail.

AMD 64-bit coreMost taken from

http://www.chip-architect.com/

Bit-interleavedbusses running “North-South”

IntegerDecode/Dispatch

• 3 types of instructions– Direct path

• RISC-like

– Vector path• Broken into smaller instructions via micro code.

– Double• 128-bit instructions which can be broken into 2 64-bit

independent instructions are (called Double)• Others are done via microcode• Most 128-bit SSE and SSE2 are made into doubles.

RS

• Each cycle an instruction is issued into one of 3 lanes. – Each lane has

• 8 RSs • 1 ALU • 1 AGU (Address Generation Unit)

– Each RS sees broadcasts from all ALUs, AGUs, L/S units etc.

Rename

• Break the physical register file into 2 parts (sort of like P6 scheme with ARF/RoB)– 72 in-flight instructions are kept in the RoB

• The other structure is the IFFRF: Integer Future File and Register File – 16 registers of committed state– 16 “future registers”– 8 scratch-pad registers

Future file

• In the P6 scheme we had to look 3 places for the data– The PRF– The RoB– The CDB (later)

• Here we look in the FF or the CDB-like-things later.– The FF holds the speculative value if it is known. – At execution complete instructions check to see if

they were the last thing to dispatch that writes to a given physical register.• This is done by tagging the FF with the RoB number.

– If they were the last to have that AR as a destination, they update the FF.

How do we use the FF? • At dispatch we:

– Check the FF for source operands– Reserve a spot in the RoB– Place our tag (RoB number) in the FF– Mark the FF entry as invalid

• At EX complete we:– Send RoB number and data to the CDB– Send data to the RoB– Update FF if tag matches

• At retire – update ARF value (from RoB)

• At mispredict– Copy ARF value into FF.

What did the FF buy us?

• P6-like advantages– No free-list for PRF– Can just clear the RAT on mis-predict.

• But no need to access the RoB looking for data– RoB data only written once (EX complete) and

only read once (Commit)• Some pain

– Early branch resolution looks hard

Re-Order-Buffer Tag definition

wrap bit

Instruction In Flight Number

re-order buffer index 0...23 sub-index 0..2

bit 7 bit 6 bit 5 bit 4 bit 3 bit 2 bit 1 bit 01) A sub-index 0,1 or 2 which identifies from which of the three lanes the instruction was dispatched. 2) A value 0..23 that identifies the “cycle" in which the instruction was dispatched. The "cycle counter" wraps to 0 after reaching 23. 3) A wrap bit. When two instructions have different wrap bits then the cycle counter has wrapped between the dispatches.

ROB: An 8-bit descriptor for 72 entries

More on the RoB

• What is basically happening is that we have three RoBs– Each one size 24– We cycle through each one so that none get

ahead of the other.– Reduces read/write ports!

– “Banking”

Mispredictions

• It looks like they wait until retirement to resolve all exceptions. – Mispredictions are treated as exceptions!

• They just clear everything and have the retired registers overwrite the speculative ones in the IFFRF

More details.

• Each x86 instruction can launch both an ALU and an AGU operation – Because x86 has lots of memory operations

this makes sense.• ALUs broadcast result tag one cycle early

– So RS can launch data to the ALU before data arrives.

8

Lane

Class summary

• Major topics– ILP in hardware (Out-of-order processors)

• How they work AND why we use them– Caches and Virtual Memory– Multi-processor– ILP in software (Complier, IA-64)– Power

• Less major topics– Memory disambiguation – Branch prediction

• Direction and target– Advanced OoO issues

• Superscalar, instruction scheduling, multi-threading, etc.

The big questions

• What is computer architecture?

• What are the metrics of performance?

• What are the techniques we use to maximize these metrics?

ILP in hardware (1/2)

• ILP definitions– Hazards vs dependencies

• Data, Name and Control dependencies– What ILP means and finding it.

• Dynamic Scheduling– Tomasulo’s (three versions!)

• You can be promised a question on this!

• Branch Prediction– Local, global, hybrid/correlating

• Tournament and gshare– BTBs

ILP in hardware (2/2)

• Multiple Issue– Static

• Static Superscalar• VLIW

– Dynamic superscalar• Speculation

– Branch, data• ILP limit studies

ILP in hardware: Questions

• True or False1. The original T-algorithm only allows reordering

within basic blocks

2. In P6, if it weren’t for precise interrupts, it would be okay to retire instructions out-of-order as long as they had finished executing and a branch isn’t skipped over.

3. ILP in hardware is limited in scope due to the “instruction window” which is basically the size of the RS.

Quick idea: SMT

• One processor, two threads.

Caching (1/2)• There is a huge amount of stuff associated

with caching. The important stuff– Locality

• Temporal/Spatial• 3’Cs model• Stack distance model

– Nuts-and-bolts• Replacement policies (LRU, pseudo-LRU)• Performance (hit rate, Thit; Tmiss, average access time)• Write back/Write thru• Block size

– Basic improvement• Multi-level cache• Critical word first• Write buffers

Caching (2/2)

• Non-standard caches– Hash– Victim– Skew

• Misc.– Virtual addresses and caching– Impact of prefetching– Latency hiding with OO execution

Cache: Questions (1/2)

• Changing __________ has an impact on compulsory misses.

• A victim cache is more likely to help with ________ than ________ though it can help both (3’Cs)

• At least _____ bits are required to keep exact track of LRU in a 5-way associative cache.

Cache question (2/2)

• A ____________ cache has a number of sets equal to the number of lines in the cache.

• A fully-associative cache with N lines will miss an access that has a stack distance of ________ (state the largest range you can).

Multi-processor

• Amdahl’s law as it applies to MP.• Bus-based multi-processor

– Snooping– MESI– Bus transaction types (BRL etc.)

• Distributed-shared– Directory schemes

• Synchronization– Critical sections– Spin-locks

Multi-processor: Question

• Under the MESI protocol what is the advantage of having a distinct clean and dirty exclusive state?

Software techniques for ILP (1/2)

• Pipeline scheduling– Reordering instructions in a basic block to remove

pipe stalls– Loop unrolling

• Static information passed to processor – Static branch prediction– Static dependence information

• Loop issues– Detecting loop dependencies– Software pipelining

Software techniques for ILP (2/2)

• Global code scheduling– Predicated instruction and CMOV – Memory reference speculation – Issues with preserving exception behavior

• IA-64 as a case study of hardware support for software ILP techniques– Speculative loads– Advanced loads– Software pipelining optimizations

Software techniques for ILP: Questions

• What is the most significant disadvantage of loop unrolling?

• Using CMOV re-write the following code snippet, removing the branch. Don’t change exception behavior and assume DIV only causes an exception if R3=0

BNE R1 R2 skipR1=R2/R3

skip: nop

Power

• Understand why it’s important• Power vs. Energy• How it’s related to the existence of multi-

core• Understand voltage scaling issues

Documents

Last lecture Some misc. stuff An older real processor Class review/overview