Author
juan
View
51
Download
1
Embed Size (px)
DESCRIPTION
Advanced Architecture. Yung-Yu Chuang. with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, Robert Sedgwick and Kevin Wayne. Basic architecture. Basic microcomputer design. clock synchronizes CPU operations control unit (CU) coordinates sequence of execution steps - PowerPoint PPT Presentation
Advanced Architecture
Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, Robert Sedgwick and Kevin Wayne
Basic architecture
Basic microcomputer designclock synchronizes CPU operationscontrol unit (CU) coordinates sequence of execution stepsALU performs arithmetic and logic operations
Basic microcomputer designThe memory storage unit holds instructions and data for a running programA bus is a group of wires that transfer data from one part to another (data, address, control)
Clocksynchronizes all CPU and BUS operationsmachine (clock) cycle measures time of a single operationclock is used to trigger eventsBasic unit of time, 1GHzclock cycle=1nsAn instruction could take multiple cycles to complete, e.g. multiply in 8088 takes 50 cycles
Instruction execution cycleFetchDecodeFetch operandsExecute Store outputprogram counterinstruction queue
Pipeline
Multi-stage pipelinePipelining makes it possible for processor to execute instructions in parallelInstruction execution divided into discrete stagesExample of a non-pipelined processor. For example, 80386. Many wasted cycles.
Pipelined executionMore efficient use of cycles, greater throughput of instructions: (80486 started to use pipelining)For k stages and n instructions, the number of required cycles is: k + (n 1)compared to k*n
Pipelining requires buffersEach buffer holds a single valueIdeal scenario: equal work for each stageSometimes it is not possibleSlowest stage determines the flow rate in the entire pipelinePipelined execution
Pipelined executionSome reasons for unequal work stagesA complex step cannot be subdivided convenientlyAn operation takes variable amount of time to execute, e.g. operand fetch time depends on where the operands are locatedRegisters Cache MemoryComplexity of operation depends on the type of operationAdd: may take one cycleMultiply: may take several cycles
Operand fetch of I2 takes three cyclesPipeline stalls for two cyclesCaused by hazardsPipeline stalls reduce overall throughputPipelined execution
Wasted cycles (pipelined)When one of the stages requires two or more clock cycles, clock cycles are again wasted.For k stages and n instructions, the number of required cycles is: k + (2n 1)
SuperscalarA superscalar processor has multiple execution pipelines. In the following, note that Stage S4 has left and right pipelines (u and v).For k states and n instructions, the number of required cycles is: k + nPentium: 2 pipelinesPentium Pro: 3
Pipeline stagesPentium 3: 10Pentium 4: 20~31Next-generation micro-architecture: 14ARM7: 3
HazardsThree types of hazardsResource hazardsOccurs when two or more instructions use the same resource, also called structural hazardsData hazardsCaused by data dependencies between instructions, e.g. result produced by I1 is read by I2Control hazardsDefault: sequential execution suits pipeliningAltering control flow (e.g., branching) causes problems, introducing control dependencies
Data hazardsadd r1, r2, #10; write r1sub r3, r1, #20; read r1
fetchdecoderegALUwbfetchdecoderegALUstallwb
Data hazardsForwarding: provides output result as soon as possiblefetchdecoderegALUwbfetchdecoderegALUstalladd r1, r2, #10; write r1sub r3, r1, #20; read r1
wb
Data hazardsForwarding: provides output result as soon as possiblefetchdecoderegALUwbfetchdecoderegALUstalladd r1, r2, #10; write r1sub r3, r1, #20; read r1
fetchdecoderegALUstallwbwb
Control hazardsfetchdecoderegALUwbfetchdecoderegALUwbfetchdecoderegALUwbfetchdecoderegALUfetchdecoderegALUbz r1, targetadd r2, r4, 0...target:add r2, r3, 0
wb
Control hazardsBraches alter control flowRequire special attention in pipeliningNeed to throw away some instructions in the pipelineDepends on when we know the branch is takenPipeline wastes three clock cyclesCalled branch penaltyReducing branch penaltyDetermine branch decision early
Control hazardsDelayed branch executionEffectively reduces the branch penaltyWe always fetch the instruction following the branchWhy throw it away?Place a useful instruction to executeThis is called delay slot
add R2,R3,R4branch targetsub R5,R6,R7. . .branch targetadd R2,R3,R4sub R5,R6,R7. . .Delay slot
Branch predictionThree prediction strategiesFixedPrediction is fixedExample: branch-never-takenNot proper for loop structuresStaticStrategy depends on the branch typeConditional branch: always not takenLoop: always takenDynamicTakes run-time history to make more accurate predictions
Branch predictionStatic predictionImproves prediction accuracy over Fixed
Instruction type
Instruction Distribution (%)
Prediction:
Branch taken?
Correct prediction
(%)
Unconditional
branch
70*0.4 = 28
Yes
28
Conditional
branch
70*0.6 = 42
No
42*0.6 = 25.2
Loop
10
Yes
10*0.9 = 9
Call/return
20
Yes
20
Overall prediction accuracy = 82.2%
Branch predictionDynamic branch predictionUses runtime historyTakes the past n branch executions of the branch type and makes the predictionSimple strategyPrediction of the next branch is the majority of the previous n branch executionsExample: n = 3If two or more of the last three branches were taken, the prediction is branch takenDepending on the type of mix, we get more than 90% prediction accuracy
Branch predictionImpact of past n branches on prediction accuracy
Type of mix
n
Compiler
Business
Scientific
0
64.1
64.4
70.4
1
91.9
95.2
86.6
2
93.3
96.5
90.8
3
93.7
96.6
91.0
4
94.5
96.8
91.8
5
94.7
97.0
92.0
10
Predict branch01
Predict no branchBranch prediction00
Predict no branch11
Predict branchbranchbranchnobranchnobranchbranchnobranchnobranchbranch
MultitaskingOS can run multiple programs at the same time.Multiple threads of execution within the same program.Scheduler utility assigns a given amount of CPU time to each running program.Rapid switching of tasksgives illusion that all programs are running at oncethe processor must support task switchingscheduling policy, round-robin, priority
Cache
SRAM vs DRAM Tran. Access Needs per bit time refresh? Cost Applications
SRAM 4 or 6 1X No 100Xcache memories
DRAM 1 10X Yes 1XMain memories,frame buffers
The CPU-Memory gap The gap widens between DRAM, disk, and CPU speeds.
registercachememorydiskAccess time(cycles)11-1050-10020,000,000
Chart1
870000003753001000
75000000200150166
280000001003550
1000000070156
80000006031
Disk seek time
DRAM access time
SRAM access time
CPU cycle time
year
ns
Sheet1
Disk seek timeDRAM access timeSRAM access timeCPU cycle time
198087,000,0003753001000
198575,000,000200150166
199028,000,0001003550
199510,000,00070156
20008,000,0006031
Sheet1
000
000
000
000
000
DRAM access time
SRAM access time
CPU cycle time
year
ns
Sheet2
0000
0000
0000
0000
0000
Disk seek time
DRAM access time
SRAM access time
CPU cycle time
year
ns
Sheet3
Memory hierarchiesSome fundamental and enduring properties of hardware and software:Fast storage technologies cost more per byte, have less capacity, and require more power (heat!). The gap between CPU and main memory speed is widening.Well-written programs tend to exhibit good locality.They suggest an approach for organizing memory and storage systems known as a memory hierarchy.
Memory system in practiceLarger, slower, and cheaper (per byte) storage devicesSmaller, faster, and more expensive (per byte) storage devices
Reading from memoryMultiple machine cycles are required when reading from memory, because it responds much more slowly than the CPU (e.g.33 MHz). The wasted clock cycles are called wait states.Processor ChipL1 Data1 cycle latency16 KB4-way assocWrite-through32B linesL1 Instruction16 KB, 4-way32B linesRegs.L2 Unified128KB--2 MB4-way assocWrite-backWrite allocate32B linesMainMemoryUp to 4GBPentium III cache hierarchy
Cache memoryHigh-speed expensive static RAM both inside and outside the CPU.Level-1 cache: inside the CPULevel-2 cache: outside the CPUCache hit: when data to be read is already in cache memoryCache miss: when data to be read is not in cache memory. When? compulsory, capacity and conflict.Cache design: cache size, n-way, block size, replacement policy
Caching in a memory hierarchy0123456789101112131415Larger, slower, cheaper Storage device at level k+1 is partitioned into blocks.Data is copied between levels in block-sized transfer units89143Smaller, faster, more Expensive device at level k caches a subset of the blocks from level k+1level klevel k+1444101010
Request14Request12General caching conceptsProgram needs object d, which is stored in some block b.Cache hitProgram finds b in the cache at level k. E.g., block 14.Cache missb is not at level k, so level k cache must fetch it from level k+1. E.g., block 12.If level k cache is full, then some current block must be replaced (evicted). Which one is the victim? Placement policy: where can the new block go? E.g., b mod 4Replacement policy: which block should be evicted? E.g., LRU930123456789101112131415level klevel k+1141412144*4*12120123Request124*4*12
LocalityPrinciple of Locality: programs tend to reuse data and instructions near those they have used recently, or that were recently referenced themselves.Temporal locality: recently referenced items are likely to be referenced in the near future.Spatial locality: items with nearby addresses tend to be referenced close together in time.In general, programs with good locality run faster then programs with poor localityLocality is the reason why cache and virtual memory are designed in architecture and operating system. Another example is web browser caches recently visited webpages.
Locality exampleDataReference array elements in succession (stride-1 reference pattern):Reference sum each iteration:InstructionsReference instructions in sequence:Cycle through loop repeatedly:
sum = 0;for (i = 0; i < n; i++)sum += a[i];return sum;Spatial localitySpatial localityTemporal localityTemporal locality
Locality exampleBeing able to look at code and get a qualitative sense of its locality is important. Does this function have good locality?int sum_array_rows(int a[M][N]){ int i, j, sum = 0;
for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum;}stride-1 reference pattern
Locality exampleDoes this function have good locality?int sum_array_cols(int a[M][N]){ int i, j, sum = 0;
for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum;}stride-N reference pattern
Blocked matrix multiply performanceBlocking (bijk and bikj) improves performance by a factor of two over unblocked versions (ijk and jik)relatively insensitive to array size.
bmmchart
6.337.096.686.565.535.188.347.28
18.4215.3410.999.016.555.68.077.14
24.8720.1712.059.216.555.77.917.03
27.6624.2711.779.927.957.798.257.38
29.627.512.7510.579.469.258.527.46
42.1241.4313.3911.5411.6211.278.527.57
47.0646.41513.0512.5912.558.647.71
48.2246.9616.7114.413.3913.168.757.84
49.6247.661916.7414.214.088.897.92
50.8148.3920.9319.3315.214.859.028.01
51.9949.3122.8221.5816.3415.59.068.07
53.1349.9124.8123.317.15169.118.1
54.4250.2926.7524.6817.7316.459.148.13
55.1550.427.4725.1918.1516.849.168.15
55.8250.4527.8625.3818.2316.99.168.15
56.150.3627.9325.3518.1716.939.188.17
kji
jki
kij
ikj
jik
ijk
bijk (bsize = 25)
bikj (bsize = 25)
Array size (n)
Cycles/iteration
mmchart
6.337.096.686.565.535.18
18.4215.3410.999.016.555.6
24.8720.1712.059.216.555.7
27.6624.2711.779.927.957.79
29.627.512.7510.579.469.25
42.1241.4313.3911.5411.6211.27
47.0646.41513.0512.5912.55
48.2246.9616.7114.413.3913.16
49.6247.661916.7414.214.08
50.8148.3920.9319.3315.214.85
51.9949.3122.8221.5816.3415.5
53.1349.9124.8123.317.1516
54.4250.2926.7524.6817.7316.45
55.1550.427.4725.1918.1516.84
55.8250.4527.8625.3818.2316.9
56.150.3627.9325.3518.1716.93
kji
jki
kij
ikj
jik
ijk
Array size (n)
Cycles/iteration
mmdata
mmkjijkikijikjjikijk
256.337.096.686.565.535.18
5018.4215.3410.999.016.555.6
7524.8720.1712.059.216.555.7
10027.6624.2711.779.927.957.79
12529.627.512.7510.579.469.25
15042.1241.4313.3911.5411.6211.27
17547.0646.41513.0512.5912.55
20048.2246.9616.7114.413.3913.16
22549.6247.661916.7414.214.08
25050.8148.3920.9319.3315.214.85
27551.9949.3122.8221.5816.3415.5
30053.1349.9124.8123.317.1516
32554.4250.2926.7524.6817.7316.45
35055.1550.427.4725.1918.1516.84
37555.8250.4527.8625.3818.2316.9
40056.150.3627.9325.3518.1716.93
bmmdata
bkjijkikijikjjikijkbijk (bsize = 25)bikj (bsize = 25)
256.337.096.686.565.535.188.347.28
5018.4215.3410.999.016.555.68.077.14
7524.8720.1712.059.216.555.77.917.03
10027.6624.2711.779.927.957.798.257.38
12529.627.512.7510.579.469.258.527.46
15042.1241.4313.3911.5411.6211.278.527.57
17547.0646.41513.0512.5912.558.647.71
20048.2246.9616.7114.413.3913.168.757.84
22549.6247.661916.7414.214.088.897.92
25050.8148.3920.9319.3315.214.859.028.01
27551.9949.3122.8221.5816.3415.59.068.07
30053.1349.9124.8123.317.15169.118.1
32554.4250.2926.7524.6817.7316.459.148.13
35055.1550.427.4725.1918.1516.849.168.15
37555.8250.4527.8625.3818.2316.99.168.15
40056.150.3627.9325.3518.1716.939.188.17
Cache-conscious programmingmake sure that memory is cache-aligned
Split data into hot and cold (list example)
Use union and bitfields to reduce size and increase locality
RISC v.s. CISC
Trade-offs of instruction setsBefore 1980, the trend is to increase instruction complexity (one-to-one mapping if possible) to bridge the gap. Reduce fetch from memory. Selling point: number of instructions, addressing modes. (CISC)1980, RISC. Simplify and regularize instructions to introduce advanced architecture for better performance, pipeline, cache, superscalar.high-level languagemachine codesemantic gapcompilerC, C++Lisp, Prolog, Haskell
RISC1980, Patternson and Ditzel (Berkeley),RISCFeaturesFixed-length instructionsLoad-store architectureRegister fileOrganizationHard-wired logicSingle-cycle instructionPipelinePros: small die size, short development time, high performanceCons: low code density, not x86 compatible
RISC Design PrinciplesSimple operationsSimple instructions that can execute in one cycleRegister-to-register operationsOnly load and store operations access memoryRest of the operations on a register-to-register basisSimple addressing modesA few addressing modes (1 or 2)Large number of registersNeeded to support register-to-register operationsMinimize the procedure call and return overhead
RISC Design PrinciplesFixed-length instructionsFacilitates efficient instruction executionSimple instruction formatFixed boundaries for various fields opcode, source operands,
CISC and RISCCISC complex instruction setlarge instruction sethigh-level operations (simpler for compiler?)requires microcode interpreter (could take a long time)examples: Intel 80x86 familyRISC reduced instruction setsmall instruction setsimple, atomic instructionsdirectly executed by hardware very quicklyeasier to incorporate advanced architecture designexamples: ARM (Advanced RISC Machines) and DEC Alpha (now Compaq), PowerPC, MIPS
CISC and RISC
CISC(Intel 486)RISC(MIPS R4000)#instructions23594Addr. modes111Inst. Size (bytes)1-124GP registers832
Why RISC?Simple instructions are preferredComplex instructions are mostly ignored by compilersDue to semantic gapSimple data structuresComplex data structures are used relatively infrequentlyBetter to support a few simple data types efficientlySynthesize complex onesSimple addressing modesComplex addressing modes lead to variable length instructionsLead to inefficient instruction decoding and scheduling
Why RISC? (contd)Large register setEfficient support for procedure calls and returnsPatterson and Sequins studyProcedure call/return: 12-15% of HLL statementsConstitute 31-33% of machine language instructionsGenerate nearly half (45%) of memory referencesSmall activation recordTanenbaums studyOnly 1.25% of the calls have more than 6 argumentsMore than 93% have less than 6 local scalar variablesLarge register set can avoid memory references
*****************************