of 52 /52
Advanced Architecture Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, Robert Sedgwick and Kevin Wayne

Advanced Architecture

  • Author
    juan

  • View
    51

  • Download
    1

Embed Size (px)

DESCRIPTION

Advanced Architecture. Yung-Yu Chuang. with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, Robert Sedgwick and Kevin Wayne. Basic architecture. Basic microcomputer design. clock synchronizes CPU operations control unit (CU) coordinates sequence of execution steps - PowerPoint PPT Presentation

Text of Advanced Architecture

  • Advanced Architecture

    Yung-Yu Chuang with slides by S. Dandamudi, Peng-Sheng Chen, Kip Irvine, Robert Sedgwick and Kevin Wayne

  • Basic architecture

  • Basic microcomputer designclock synchronizes CPU operationscontrol unit (CU) coordinates sequence of execution stepsALU performs arithmetic and logic operations

  • Basic microcomputer designThe memory storage unit holds instructions and data for a running programA bus is a group of wires that transfer data from one part to another (data, address, control)

  • Clocksynchronizes all CPU and BUS operationsmachine (clock) cycle measures time of a single operationclock is used to trigger eventsBasic unit of time, 1GHzclock cycle=1nsAn instruction could take multiple cycles to complete, e.g. multiply in 8088 takes 50 cycles

  • Instruction execution cycleFetchDecodeFetch operandsExecute Store outputprogram counterinstruction queue

  • Pipeline

  • Multi-stage pipelinePipelining makes it possible for processor to execute instructions in parallelInstruction execution divided into discrete stagesExample of a non-pipelined processor. For example, 80386. Many wasted cycles.

  • Pipelined executionMore efficient use of cycles, greater throughput of instructions: (80486 started to use pipelining)For k stages and n instructions, the number of required cycles is: k + (n 1)compared to k*n

  • Pipelining requires buffersEach buffer holds a single valueIdeal scenario: equal work for each stageSometimes it is not possibleSlowest stage determines the flow rate in the entire pipelinePipelined execution

  • Pipelined executionSome reasons for unequal work stagesA complex step cannot be subdivided convenientlyAn operation takes variable amount of time to execute, e.g. operand fetch time depends on where the operands are locatedRegisters Cache MemoryComplexity of operation depends on the type of operationAdd: may take one cycleMultiply: may take several cycles

  • Operand fetch of I2 takes three cyclesPipeline stalls for two cyclesCaused by hazardsPipeline stalls reduce overall throughputPipelined execution

  • Wasted cycles (pipelined)When one of the stages requires two or more clock cycles, clock cycles are again wasted.For k stages and n instructions, the number of required cycles is: k + (2n 1)

  • SuperscalarA superscalar processor has multiple execution pipelines. In the following, note that Stage S4 has left and right pipelines (u and v).For k states and n instructions, the number of required cycles is: k + nPentium: 2 pipelinesPentium Pro: 3

  • Pipeline stagesPentium 3: 10Pentium 4: 20~31Next-generation micro-architecture: 14ARM7: 3

  • HazardsThree types of hazardsResource hazardsOccurs when two or more instructions use the same resource, also called structural hazardsData hazardsCaused by data dependencies between instructions, e.g. result produced by I1 is read by I2Control hazardsDefault: sequential execution suits pipeliningAltering control flow (e.g., branching) causes problems, introducing control dependencies

  • Data hazardsadd r1, r2, #10; write r1sub r3, r1, #20; read r1

    fetchdecoderegALUwbfetchdecoderegALUstallwb

  • Data hazardsForwarding: provides output result as soon as possiblefetchdecoderegALUwbfetchdecoderegALUstalladd r1, r2, #10; write r1sub r3, r1, #20; read r1

    wb

  • Data hazardsForwarding: provides output result as soon as possiblefetchdecoderegALUwbfetchdecoderegALUstalladd r1, r2, #10; write r1sub r3, r1, #20; read r1

    fetchdecoderegALUstallwbwb

  • Control hazardsfetchdecoderegALUwbfetchdecoderegALUwbfetchdecoderegALUwbfetchdecoderegALUfetchdecoderegALUbz r1, targetadd r2, r4, 0...target:add r2, r3, 0

    wb

  • Control hazardsBraches alter control flowRequire special attention in pipeliningNeed to throw away some instructions in the pipelineDepends on when we know the branch is takenPipeline wastes three clock cyclesCalled branch penaltyReducing branch penaltyDetermine branch decision early

  • Control hazardsDelayed branch executionEffectively reduces the branch penaltyWe always fetch the instruction following the branchWhy throw it away?Place a useful instruction to executeThis is called delay slot

    add R2,R3,R4branch targetsub R5,R6,R7. . .branch targetadd R2,R3,R4sub R5,R6,R7. . .Delay slot

  • Branch predictionThree prediction strategiesFixedPrediction is fixedExample: branch-never-takenNot proper for loop structuresStaticStrategy depends on the branch typeConditional branch: always not takenLoop: always takenDynamicTakes run-time history to make more accurate predictions

  • Branch predictionStatic predictionImproves prediction accuracy over Fixed

    Instruction type

    Instruction Distribution (%)

    Prediction:

    Branch taken?

    Correct prediction

    (%)

    Unconditional

    branch

    70*0.4 = 28

    Yes

    28

    Conditional

    branch

    70*0.6 = 42

    No

    42*0.6 = 25.2

    Loop

    10

    Yes

    10*0.9 = 9

    Call/return

    20

    Yes

    20

    Overall prediction accuracy = 82.2%

  • Branch predictionDynamic branch predictionUses runtime historyTakes the past n branch executions of the branch type and makes the predictionSimple strategyPrediction of the next branch is the majority of the previous n branch executionsExample: n = 3If two or more of the last three branches were taken, the prediction is branch takenDepending on the type of mix, we get more than 90% prediction accuracy

  • Branch predictionImpact of past n branches on prediction accuracy

    Type of mix

    n

    Compiler

    Business

    Scientific

    0

    64.1

    64.4

    70.4

    1

    91.9

    95.2

    86.6

    2

    93.3

    96.5

    90.8

    3

    93.7

    96.6

    91.0

    4

    94.5

    96.8

    91.8

    5

    94.7

    97.0

    92.0

  • 10

    Predict branch01

    Predict no branchBranch prediction00

    Predict no branch11

    Predict branchbranchbranchnobranchnobranchbranchnobranchnobranchbranch

  • MultitaskingOS can run multiple programs at the same time.Multiple threads of execution within the same program.Scheduler utility assigns a given amount of CPU time to each running program.Rapid switching of tasksgives illusion that all programs are running at oncethe processor must support task switchingscheduling policy, round-robin, priority

  • Cache

  • SRAM vs DRAM Tran. Access Needs per bit time refresh? Cost Applications

    SRAM 4 or 6 1X No 100Xcache memories

    DRAM 1 10X Yes 1XMain memories,frame buffers

  • The CPU-Memory gap The gap widens between DRAM, disk, and CPU speeds.

    registercachememorydiskAccess time(cycles)11-1050-10020,000,000

    Chart1

    870000003753001000

    75000000200150166

    280000001003550

    1000000070156

    80000006031

    Disk seek time

    DRAM access time

    SRAM access time

    CPU cycle time

    year

    ns

    Sheet1

    Disk seek timeDRAM access timeSRAM access timeCPU cycle time

    198087,000,0003753001000

    198575,000,000200150166

    199028,000,0001003550

    199510,000,00070156

    20008,000,0006031

    Sheet1

    000

    000

    000

    000

    000

    DRAM access time

    SRAM access time

    CPU cycle time

    year

    ns

    Sheet2

    0000

    0000

    0000

    0000

    0000

    Disk seek time

    DRAM access time

    SRAM access time

    CPU cycle time

    year

    ns

    Sheet3

  • Memory hierarchiesSome fundamental and enduring properties of hardware and software:Fast storage technologies cost more per byte, have less capacity, and require more power (heat!). The gap between CPU and main memory speed is widening.Well-written programs tend to exhibit good locality.They suggest an approach for organizing memory and storage systems known as a memory hierarchy.

  • Memory system in practiceLarger, slower, and cheaper (per byte) storage devicesSmaller, faster, and more expensive (per byte) storage devices

  • Reading from memoryMultiple machine cycles are required when reading from memory, because it responds much more slowly than the CPU (e.g.33 MHz). The wasted clock cycles are called wait states.Processor ChipL1 Data1 cycle latency16 KB4-way assocWrite-through32B linesL1 Instruction16 KB, 4-way32B linesRegs.L2 Unified128KB--2 MB4-way assocWrite-backWrite allocate32B linesMainMemoryUp to 4GBPentium III cache hierarchy

  • Cache memoryHigh-speed expensive static RAM both inside and outside the CPU.Level-1 cache: inside the CPULevel-2 cache: outside the CPUCache hit: when data to be read is already in cache memoryCache miss: when data to be read is not in cache memory. When? compulsory, capacity and conflict.Cache design: cache size, n-way, block size, replacement policy

  • Caching in a memory hierarchy0123456789101112131415Larger, slower, cheaper Storage device at level k+1 is partitioned into blocks.Data is copied between levels in block-sized transfer units89143Smaller, faster, more Expensive device at level k caches a subset of the blocks from level k+1level klevel k+1444101010

  • Request14Request12General caching conceptsProgram needs object d, which is stored in some block b.Cache hitProgram finds b in the cache at level k. E.g., block 14.Cache missb is not at level k, so level k cache must fetch it from level k+1. E.g., block 12.If level k cache is full, then some current block must be replaced (evicted). Which one is the victim? Placement policy: where can the new block go? E.g., b mod 4Replacement policy: which block should be evicted? E.g., LRU930123456789101112131415level klevel k+1141412144*4*12120123Request124*4*12

  • LocalityPrinciple of Locality: programs tend to reuse data and instructions near those they have used recently, or that were recently referenced themselves.Temporal locality: recently referenced items are likely to be referenced in the near future.Spatial locality: items with nearby addresses tend to be referenced close together in time.In general, programs with good locality run faster then programs with poor localityLocality is the reason why cache and virtual memory are designed in architecture and operating system. Another example is web browser caches recently visited webpages.

  • Locality exampleDataReference array elements in succession (stride-1 reference pattern):Reference sum each iteration:InstructionsReference instructions in sequence:Cycle through loop repeatedly:

    sum = 0;for (i = 0; i < n; i++)sum += a[i];return sum;Spatial localitySpatial localityTemporal localityTemporal locality

  • Locality exampleBeing able to look at code and get a qualitative sense of its locality is important. Does this function have good locality?int sum_array_rows(int a[M][N]){ int i, j, sum = 0;

    for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum;}stride-1 reference pattern

  • Locality exampleDoes this function have good locality?int sum_array_cols(int a[M][N]){ int i, j, sum = 0;

    for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum;}stride-N reference pattern

  • Blocked matrix multiply performanceBlocking (bijk and bikj) improves performance by a factor of two over unblocked versions (ijk and jik)relatively insensitive to array size.

    bmmchart

    6.337.096.686.565.535.188.347.28

    18.4215.3410.999.016.555.68.077.14

    24.8720.1712.059.216.555.77.917.03

    27.6624.2711.779.927.957.798.257.38

    29.627.512.7510.579.469.258.527.46

    42.1241.4313.3911.5411.6211.278.527.57

    47.0646.41513.0512.5912.558.647.71

    48.2246.9616.7114.413.3913.168.757.84

    49.6247.661916.7414.214.088.897.92

    50.8148.3920.9319.3315.214.859.028.01

    51.9949.3122.8221.5816.3415.59.068.07

    53.1349.9124.8123.317.15169.118.1

    54.4250.2926.7524.6817.7316.459.148.13

    55.1550.427.4725.1918.1516.849.168.15

    55.8250.4527.8625.3818.2316.99.168.15

    56.150.3627.9325.3518.1716.939.188.17

    kji

    jki

    kij

    ikj

    jik

    ijk

    bijk (bsize = 25)

    bikj (bsize = 25)

    Array size (n)

    Cycles/iteration

    mmchart

    6.337.096.686.565.535.18

    18.4215.3410.999.016.555.6

    24.8720.1712.059.216.555.7

    27.6624.2711.779.927.957.79

    29.627.512.7510.579.469.25

    42.1241.4313.3911.5411.6211.27

    47.0646.41513.0512.5912.55

    48.2246.9616.7114.413.3913.16

    49.6247.661916.7414.214.08

    50.8148.3920.9319.3315.214.85

    51.9949.3122.8221.5816.3415.5

    53.1349.9124.8123.317.1516

    54.4250.2926.7524.6817.7316.45

    55.1550.427.4725.1918.1516.84

    55.8250.4527.8625.3818.2316.9

    56.150.3627.9325.3518.1716.93

    kji

    jki

    kij

    ikj

    jik

    ijk

    Array size (n)

    Cycles/iteration

    mmdata

    mmkjijkikijikjjikijk

    256.337.096.686.565.535.18

    5018.4215.3410.999.016.555.6

    7524.8720.1712.059.216.555.7

    10027.6624.2711.779.927.957.79

    12529.627.512.7510.579.469.25

    15042.1241.4313.3911.5411.6211.27

    17547.0646.41513.0512.5912.55

    20048.2246.9616.7114.413.3913.16

    22549.6247.661916.7414.214.08

    25050.8148.3920.9319.3315.214.85

    27551.9949.3122.8221.5816.3415.5

    30053.1349.9124.8123.317.1516

    32554.4250.2926.7524.6817.7316.45

    35055.1550.427.4725.1918.1516.84

    37555.8250.4527.8625.3818.2316.9

    40056.150.3627.9325.3518.1716.93

    bmmdata

    bkjijkikijikjjikijkbijk (bsize = 25)bikj (bsize = 25)

    256.337.096.686.565.535.188.347.28

    5018.4215.3410.999.016.555.68.077.14

    7524.8720.1712.059.216.555.77.917.03

    10027.6624.2711.779.927.957.798.257.38

    12529.627.512.7510.579.469.258.527.46

    15042.1241.4313.3911.5411.6211.278.527.57

    17547.0646.41513.0512.5912.558.647.71

    20048.2246.9616.7114.413.3913.168.757.84

    22549.6247.661916.7414.214.088.897.92

    25050.8148.3920.9319.3315.214.859.028.01

    27551.9949.3122.8221.5816.3415.59.068.07

    30053.1349.9124.8123.317.15169.118.1

    32554.4250.2926.7524.6817.7316.459.148.13

    35055.1550.427.4725.1918.1516.849.168.15

    37555.8250.4527.8625.3818.2316.99.168.15

    40056.150.3627.9325.3518.1716.939.188.17

  • Cache-conscious programmingmake sure that memory is cache-aligned

    Split data into hot and cold (list example)

    Use union and bitfields to reduce size and increase locality

  • RISC v.s. CISC

  • Trade-offs of instruction setsBefore 1980, the trend is to increase instruction complexity (one-to-one mapping if possible) to bridge the gap. Reduce fetch from memory. Selling point: number of instructions, addressing modes. (CISC)1980, RISC. Simplify and regularize instructions to introduce advanced architecture for better performance, pipeline, cache, superscalar.high-level languagemachine codesemantic gapcompilerC, C++Lisp, Prolog, Haskell

  • RISC1980, Patternson and Ditzel (Berkeley),RISCFeaturesFixed-length instructionsLoad-store architectureRegister fileOrganizationHard-wired logicSingle-cycle instructionPipelinePros: small die size, short development time, high performanceCons: low code density, not x86 compatible

  • RISC Design PrinciplesSimple operationsSimple instructions that can execute in one cycleRegister-to-register operationsOnly load and store operations access memoryRest of the operations on a register-to-register basisSimple addressing modesA few addressing modes (1 or 2)Large number of registersNeeded to support register-to-register operationsMinimize the procedure call and return overhead

  • RISC Design PrinciplesFixed-length instructionsFacilitates efficient instruction executionSimple instruction formatFixed boundaries for various fields opcode, source operands,

  • CISC and RISCCISC complex instruction setlarge instruction sethigh-level operations (simpler for compiler?)requires microcode interpreter (could take a long time)examples: Intel 80x86 familyRISC reduced instruction setsmall instruction setsimple, atomic instructionsdirectly executed by hardware very quicklyeasier to incorporate advanced architecture designexamples: ARM (Advanced RISC Machines) and DEC Alpha (now Compaq), PowerPC, MIPS

  • CISC and RISC

    CISC(Intel 486)RISC(MIPS R4000)#instructions23594Addr. modes111Inst. Size (bytes)1-124GP registers832

  • Why RISC?Simple instructions are preferredComplex instructions are mostly ignored by compilersDue to semantic gapSimple data structuresComplex data structures are used relatively infrequentlyBetter to support a few simple data types efficientlySynthesize complex onesSimple addressing modesComplex addressing modes lead to variable length instructionsLead to inefficient instruction decoding and scheduling

  • Why RISC? (contd)Large register setEfficient support for procedure calls and returnsPatterson and Sequins studyProcedure call/return: 12-15% of HLL statementsConstitute 31-33% of machine language instructionsGenerate nearly half (45%) of memory referencesSmall activation recordTanenbaums studyOnly 1.25% of the calls have more than 6 argumentsMore than 93% have less than 6 local scalar variablesLarge register set can avoid memory references

    *****************************