Upload
sheila-james
View
217
Download
1
Embed Size (px)
Citation preview
Last lecture
Some misc. stuffAn older real processor
Class review/overview.
Misc. Status issues• Saturday (4/18) @9pm
– Project due.• Sunday 4/19:
– Review session 4-6pm (DOW 1017)– Pizza House 7-9pm (RSVP request posted, please respond by
tomorrow)• Tuesday (4/21):
– Project talk groups• Sign up if you haven’t yet done so!
• Tuesday (4/21) – Written report due (via e-mail) @9pm
• Wednesday (4/22) – Review session 3-4:30pm (1500 EECS)
• Thursday (4/23)– Exam at 4pm in our classroom. That’s 4pm sharp, not Michigan time!
Office hour changes
• Friday 4/17 moved to 10:30am-12 in my office (4632 Beyster)
• Tuesday 4/21 office hours moved to 3:15-4:45 in my office
Stuff still to do
• Oral report– Don’t forget to be there for the whole hour (or
longer if your group is during class time)
– PowerPoint or other slides• Either bring portable or USB stick
• Written report– Due 9pm Tuesday via e-mail.
AMD 64-bit coreMost taken from
http://www.chip-architect.com/
Bit-interleavedbusses running “North-South”
IntegerDecode/Dispatch
• 3 types of instructions– Direct path
• RISC-like
– Vector path• Broken into smaller instructions via micro code.
– Double• 128-bit instructions which can be broken into 2 64-bit
independent instructions are (called Double)• Others are done via microcode• Most 128-bit SSE and SSE2 are made into doubles.
RS
• Each cycle an instruction is issued into one of 3 lanes. – Each lane has
• 8 RSs • 1 ALU • 1 AGU (Address Generation Unit)
– Each RS sees broadcasts from all ALUs, AGUs, L/S units etc.
Rename
• Break the physical register file into 2 parts (sort of like P6 scheme with ARF/RoB)– 72 in-flight instructions are kept in the RoB
• The other structure is the IFFRF: Integer Future File and Register File – 16 registers of committed state– 16 “future registers”– 8 scratch-pad registers
Future file
• In the P6 scheme we had to look 3 places for the data– The PRF– The RoB– The CDB (later)
• Here we look in the FF or the CDB-like-things later.– The FF holds the speculative value if it is known. – At execution complete instructions check to see if
they were the last thing to dispatch that writes to a given physical register.• This is done by tagging the FF with the RoB number.
– If they were the last to have that AR as a destination, they update the FF.
How do we use the FF? • At dispatch we:
– Check the FF for source operands– Reserve a spot in the RoB– Place our tag (RoB number) in the FF– Mark the FF entry as invalid
• At EX complete we:– Send RoB number and data to the CDB– Send data to the RoB– Update FF if tag matches
• At retire – update ARF value (from RoB)
• At mispredict– Copy ARF value into FF.
What did the FF buy us?
• P6-like advantages– No free-list for PRF– Can just clear the RAT on mis-predict.
• But no need to access the RoB looking for data– RoB data only written once (EX complete) and
only read once (Commit)• Some pain
– Early branch resolution looks hard
Re-Order-Buffer Tag definition
wrap bit
Instruction In Flight Number
re-order buffer index 0...23 sub-index 0..2
bit 7 bit 6 bit 5 bit 4 bit 3 bit 2 bit 1 bit 01) A sub-index 0,1 or 2 which identifies from which of the three lanes the instruction was dispatched. 2) A value 0..23 that identifies the “cycle" in which the instruction was dispatched. The "cycle counter" wraps to 0 after reaching 23. 3) A wrap bit. When two instructions have different wrap bits then the cycle counter has wrapped between the dispatches.
ROB: An 8-bit descriptor for 72 entries
More on the RoB
• What is basically happening is that we have three RoBs– Each one size 24– We cycle through each one so that none get
ahead of the other.– Reduces read/write ports!
– “Banking”
Mispredictions
• It looks like they wait until retirement to resolve all exceptions. – Mispredictions are treated as exceptions!
• They just clear everything and have the retired registers overwrite the speculative ones in the IFFRF
More details.
• Each x86 instruction can launch both an ALU and an AGU operation – Because x86 has lots of memory operations
this makes sense.• ALUs broadcast result tag one cycle early
– So RS can launch data to the ALU before data arrives.
8
Lane
Class summary
• Major topics– ILP in hardware (Out-of-order processors)
• How they work AND why we use them– Caches and Virtual Memory– Multi-processor– ILP in software (Complier, IA-64)– Power
• Less major topics– Memory disambiguation – Branch prediction
• Direction and target– Advanced OoO issues
• Superscalar, instruction scheduling, multi-threading, etc.
The big questions
• What is computer architecture?
• What are the metrics of performance?
• What are the techniques we use to maximize these metrics?
ILP in hardware (1/2)
• ILP definitions– Hazards vs dependencies
• Data, Name and Control dependencies– What ILP means and finding it.
• Dynamic Scheduling– Tomasulo’s (three versions!)
• You can be promised a question on this!
• Branch Prediction– Local, global, hybrid/correlating
• Tournament and gshare– BTBs
ILP in hardware (2/2)
• Multiple Issue– Static
• Static Superscalar• VLIW
– Dynamic superscalar• Speculation
– Branch, data• ILP limit studies
ILP in hardware: Questions
• True or False1. The original T-algorithm only allows reordering
within basic blocks
2. In P6, if it weren’t for precise interrupts, it would be okay to retire instructions out-of-order as long as they had finished executing and a branch isn’t skipped over.
3. ILP in hardware is limited in scope due to the “instruction window” which is basically the size of the RS.
Quick idea: SMT
• One processor, two threads.
Caching (1/2)• There is a huge amount of stuff associated
with caching. The important stuff– Locality
• Temporal/Spatial• 3’Cs model• Stack distance model
– Nuts-and-bolts• Replacement policies (LRU, pseudo-LRU)• Performance (hit rate, Thit; Tmiss, average access time)• Write back/Write thru• Block size
– Basic improvement• Multi-level cache• Critical word first• Write buffers
Caching (2/2)
• Non-standard caches– Hash– Victim– Skew
• Misc.– Virtual addresses and caching– Impact of prefetching– Latency hiding with OO execution
Cache: Questions (1/2)
• Changing __________ has an impact on compulsory misses.
• A victim cache is more likely to help with ________ than ________ though it can help both (3’Cs)
• At least _____ bits are required to keep exact track of LRU in a 5-way associative cache.
Cache question (2/2)
• A ____________ cache has a number of sets equal to the number of lines in the cache.
• A fully-associative cache with N lines will miss an access that has a stack distance of ________ (state the largest range you can).
Multi-processor
• Amdahl’s law as it applies to MP.• Bus-based multi-processor
– Snooping– MESI– Bus transaction types (BRL etc.)
• Distributed-shared– Directory schemes
• Synchronization– Critical sections– Spin-locks
Multi-processor: Question
• Under the MESI protocol what is the advantage of having a distinct clean and dirty exclusive state?
Software techniques for ILP (1/2)
• Pipeline scheduling– Reordering instructions in a basic block to remove
pipe stalls– Loop unrolling
• Static information passed to processor – Static branch prediction– Static dependence information
• Loop issues– Detecting loop dependencies– Software pipelining
Software techniques for ILP (2/2)
• Global code scheduling– Predicated instruction and CMOV – Memory reference speculation – Issues with preserving exception behavior
• IA-64 as a case study of hardware support for software ILP techniques– Speculative loads– Advanced loads– Software pipelining optimizations
Software techniques for ILP: Questions
• What is the most significant disadvantage of loop unrolling?
• Using CMOV re-write the following code snippet, removing the branch. Don’t change exception behavior and assume DIV only causes an exception if R3=0
BNE R1 R2 skipR1=R2/R3
skip: nop
Power
• Understand why it’s important• Power vs. Energy• How it’s related to the existence of multi-
core• Understand voltage scaling issues