Upload
pauline-wade
View
232
Download
0
Tags:
Embed Size (px)
Citation preview
Processor Level Parallelism
Improving the Pipeline
• Pipelined processor– Ideal speedup = num stages– Branches / conflicts mean limited returns after certain
point
ILP
• Instruction Level Parallelism– Ability to run multiple instructions at the same
time
Superscalar
• Superscalar : capable of running multiple instructions at a time– Multiple execution units• Widen slowest part of pipeline
Superscalar
• Multi-issue : Start multiple instructions per clock– Parallel pipes
Superscalar
• Multi-issue pipeline feeding multiple execution units
Superscalar
• Issue:Dependency issues just got MUCH harder…
Superscalar Pro/Con
• Good– The hardware solves everything:• Hardware solves scheduling/registers/etc…• Compiler can still help matters
– Binary compatibility• New hardware issues old instructions in a more
efficient way
• Bad– Complex hardware– Limit to scale
VLIW
• VLIW : Very Large Instruction Word– One instruction contains multiple ops
VLIW
• Instructions VERY large– 240 bits?– Wasted space addressed by bundles• No dependencies within bundle
Who does work?
• Compiler assembles long instructions– Reorders at compile time
• Compiler has more time,information
VLIW Uses
• Itanium : – EPIC : Explicitly Parallel Computing– 3 instruction bundles
VLIW Pro/Con
• Good– Simple hardware• Add new functional units with no new scheduling
hardware
– Better optimization in compiler
• Bad– Binary compatibility : compiler builds for one
specific hardware– Good compilers are HARD to write
ARM 15
• Modern CPU:
Processor Parallelism
• Process Parallelism : Run multiple instruction streams simultaneously
Process vs Thread
• Process : Program– Own memory space– Has at least one
thread
Process vs Thread
• Thread : Instruction sequence– Own registers/stack– Share memory
with otherthreads in process
Threaded Code
• Demo…
Context Switching
• Four threads running in 4-wide pipeline– Can't always fill all 4 issue slots– Have bubbles from memory access, page faults,
etc…
Context Switching
• Threads often have bubbles…
Multithreading
• MultithreadingAlternate threads to maximize hardware use– Course : run until stall, then switch
– Fine : switch every cycle
– Either one needs extra hardware
Multithreading Superscalar
• A 2-instruction wide pipeline with multithreading:– Still only one process per cycle
Fine grained Course grained
SMT
• SMT : Simultaneous Multithreading– AKA Hyperthreading
• Issue ops from multiple threads in one cycle
• Maximize use of functional units– But need to track registers each instruction goes
with…
SMT Challenges
• Resources must be duplicated or split– Split too thin hurts performance…– Duplicate everything and you aren't maximizing
use of hardware…
Intel vs AMD
• Variations on SMT
Getting Faster
• Pipelining helps to a point• Superscalar/VLIW helps to a point• SMT helps a bit• Chips getting faster
Getting Faster
• Pipelining helps to a point• Superscalar/VLIW helps to a point• SMT helps a bit• Chips getting faster• Only so much speedup possible– Power = heat– Power C V2 f
• C = Capacitance, how well it “stores” a charge• V = Voltage• f = frequency. I.e., how fast clock is (e.g., 3 GHz)
Power Density Prediction circa 2000
40048008
8080 8085
8086
286 386486
Pentium® procP6
1
10
100
1000
10000
1970 1980 1990 2000 2010
Year
Pow
er D
ensi
ty (W
/cm
2)
Hot Plate
Nuclear Reactor
Rocket Nozzle
Source: S. Borkar (Intel)
Sun’s Surface
Core 2
Adapted from UC Berkeley "The Beauty and Joy of Computing"
Moore's Law Related Curves
Adapted from UC Berkeley "The Beauty and Joy of Computing"
Moore's Law Related Curves
Adapted from UC Berkeley "The Beauty and Joy of Computing"
Going Multi-core Helps Energy Efficiency• Power of typical integrated circuit C V2 f– C = Capacitance, how well it “stores” a charge– V = Voltage– f = frequency. I.e., how fast clock is (e.g., 3 GHz)
William Holt, HOT Chips 2005
Adapted from UC Berkeley "The Beauty and Joy of Computing"