58
ARM Processor Architecture (I) ARM Processor Architecture (I) Speaker: Lung-Hao Chang 張龍豪 Advisor: Porf. Andy Wu 吳安宇教授 Graduate Institute of Electronics Engineering, National Taiwan University Modified from National Chiao-Tung University IP Core Design course

ARM Processor Architecture (I) - access.ee.ntu.edu.twaccess.ee.ntu.edu.tw/course/soc2004/SOC實驗教材/ARM Processor... · ARM Processor Architecture (I) Speaker: Lung-Hao Chang

Embed Size (px)

Citation preview

  • ARM Processor Architecture (I)ARM Processor Architecture (I)

    Speaker: Lung-Hao Chang Advisor: Porf. Andy Wu

    Graduate Institute of Electronics Engineering,National Taiwan University

    Modified from National Chiao-Tung University IP Core Design course

  • 2SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    OutlineThumb instruction setARM/Thumb interworkingARM organizationSummary

  • 3SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Thumb instruction set

  • 4SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Thumb-ARM DifferenceThumb instruction set is a subset of the ARM instruction set and the instructions operate on a restricted view of the ARM registersMost Thumb instructions are executed unconditionally All ARM instructions are executed conditionally

    Many Thumb data processing instructions use two 2-address format, i.e. the destination register is the same as one of the source registers ARM data processing instructions, with the exception of

    the 64-bit multiplies, use a 3-address formatThumb instruction formats are less regular than ARM instruction formats => dense encoding

  • 5SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Registers Access in ThumbNot all registers are directly accessible in ThumbLow register r0 r7 fully accessible

    High register r8 r12 only accessible with MOV, ADD, CMP

    SP (Stack Pointer), LR (Link Register) & PC(Program Counter) limited accessibility, certain instructions have implicit

    access to theseCPSR only indirect access

    SPSR no access

  • 6SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Thumb Accessible RegistersShaded registers have restricted access

  • 7SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Branches (1/2)Thumb defines three PC-relative branch instructions, each of which have different offset ranges Offset depends upon the number of available bits

    Conditional Branches B label 8-bit offset: range of -128 to 127 instruction (+/-256 bytes) Only conditional Thumb instructions

    8-bit offset15 12 11 8 7 0

    1 1 0 1 cond (1) B

  • 8SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Branches (2/2)Unconditional Branches B label 11-bit offset: range of -1024 to 1023 instructions (+/-2K bytes)

    11-bit offset15 1110 0

    1 1 1 0 0 (2) B

    Long Branches with Link BL subroutine Implemented as a pair of instructions 22-bit offset: range of -2097152 to 2097151 instruction (+/-4M

    bytes)

    11-bit offset15 12 1110 0

    1 1 1 1 H (3) BL

    10-bit offset15 11 10 1 0

    1 1 1 0 1 (3a) BLX 0

  • 9SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Data Processing InstructionSubset of the ARM data processing instructionsSeparate shift instructions (e.g. LSL, ASR, LSR, ROR)

    LSL Rd,Rs,#Imm5 ;Rd:=Rs #Imm5ASR Rd,Rs ;Rd:=Rd Rs

    Two operands for data processing instructions Act on low registers

    BIC Rd,Rs ;Rd:=Rd AND NOT RsADD Rd,#Imm8 ;Rd:=Rd + #Imm8

    Also three operand forms of add, subtract and shiftsADD Rd,Rs,#Imm3 ;Rd:=Rs + #Imm3

    Condition code always set by low register operations

  • 10SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Load or Store RegisterTwo pre-indexed addressing modes Base register + offset register Base register + 5-bit offset, where offset scaled by

    4 for word accesses (range of 0-127 bytes / 0-31 words) STR Rd,[Rb,#Imm7]

    2 for halfword accesses (range of 0-63 bytes / 0-31 halfwords) LDRH Rd,[Rb,#Imm6]

    1 for bytes accesses (range of 0-31 bytes) LDRB Rd,[Rb,#Imm5]

    Special forms Load with PC as base with 1K byte immediate offset (word aligned)

    Used for loading a value from a literal pool Load and store with SP as base with 1K byte immediate offset (word

    aligned) Used for accessing local variables on the stack

  • 11SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Block Data TransfersMemory copy, incrementing base pointer after transfer STMIA Rb!, {Low Reg list} LDMIA Rb!, {Low Reg list}

    Full descending stack operations PUSH {Low Reg list} PUSH {Low Reg List, LR} POP {Low Reg list} POP {Low Reg List, PC}

    The optional addition of the LR/PC provides support for subroutine entry/exit

  • 12SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Thumb Instruction Entry and ExitT bit, bit 5 of CPSR If T = 1, the processor interprets the instruction stream as

    16-bit Thumb instruction If T = 0, the processor interprets if as standard ARM

    instructionsThumb Entry ARM cores startup, after reset, execution ARM

    instructions Executing a branch and Exchange instruction (BX)

    Set the T bit if the bottom bit of the specified register was set Switch the PC to the address given in the remainder of the

    register

    Thumb Exit Executing a thumb BX instruction

  • 13SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    MiscellaneousThumb SWI instruction format Same effect as ARM, but SWI number limited to 0-255 Syntax:

    SWI

    SWI number

    15 8 7 0

    1 1 0 1 1 1 1 1

    Indirect access to CPSR and no access to SPSR, so no MRS or MSR instructionsNo coprocessor instruction space

  • 14SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    ARM Thumb-2 core technology

    New instruction set for the ARM architectureEnhanced levels of performance, energy efficiency, and code density for a wide range of embedded applications

  • 15SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Thumb Instruction Set (1/3)

  • 16SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Thumb Instruction Set (2/3)

  • 17SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Thumb Instruction Set (3/3)

  • 18SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Thumb Instruction Format

  • 19SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    ARM/Thumb interworking

  • 20SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    The Need for InterworkingThe code density of Thumb and its performance from narrow memory make it ideal for the bulk of C code in many systems. However there is still a need to change between ARM and Thumb state within most applications ARM code provides better performance from wide memory

    Therefore ideal for speed-critical parts of an application Some functions can only be performed with ARM instructions, e.g.

    Access to CPSR (to enable/disable interrupts & to change mode) Access to coprocessors

    Exception Handling ARM state is automatically entered for exception handling, but system

    specification may require usage of Thumb code for main handler Simple standalone Thumb programs will also need an ARM

    assembler header to change state and call the Thumb routine

  • 21SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    ARM/Thumb InterworkingInterworking can be carried out using the Branch Exchange instruction BX Rn ;Thumb state Branch

    ;Exchange BX Rn ;ARM state Branch

    Can also be used as an absolute branch without a state change

  • 22SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Example;start off in ARM state

    CODE32ADR r0,Into_Thumb+1 ;generate branch target

    ;address & set bit 0;hence arrive Thumb state

    BX r0 ;branch exchange to ThumbCODE16 ;assemble subsequent as Thumb

    Into_Thumb ADR r5,Back_to_ARM ;generate branch target to

    ;word-aligned address,;hence bit 0 is cleared.

    BX r5 ;branch exchange to ARMCODE32 ;assemble subsequent as ARM

    Back_to_ARM

  • 23SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    ARM organization

  • 24SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    3-Stage Pipeline ARM OrganizationRegister Bank 2 read ports, 1 write ports,

    access any register 1 additional read port, 1

    additional write port for r15 (PC)Barrel Shifter Shift or rotate the operand by

    any number of bitsALUAddress register and incrementerData Registers Hold data passing to and from

    memoryInstruction Decoder and Control

    multiply

    data out register

    instruction

    decode

    &

    control

    incrementer

    registerbank

    address register

    barrelshifter

    A[31:0]

    D[31:0]

    data in register

    ALU

    control

    PC

    PC

    ALU bus

    A bus

    B bus

    register

  • 25SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    3-Stage Pipeline (1/2)

    Fetch The instruction is fetched from memory and placed in the instruction

    pipelineDecode The instruction is decoded and the datapath control signals prepared

    for the next cycleExecute The register bank is read, an operand shifted, the ALU result

    generated and written back into destination register

  • 26SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    3-Stage Pipeline (2/2)At any time slice, 3 different instructions may occupy each of these stages, so the hardware in each stage has to be capable of independent operationsWhen the processor is executing data processing instructions , the latency = 3 cycles and the throughput = 1 instruction/cycle

  • 27SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Data Processing Instruction

    address register

    increment

    registersRd

    Rn

    PC

    as ins.

    as instruction

    mult

    data out data in i. pipe

    [7:0]

    (b) register - immediate operations

    address register

    increment

    registersRd

    Rn

    PC

    Rm

    as ins.

    as instruction

    mult

    data out data in i. pipe

    (a) register - register operations

    All operations take place in a single clock cycle

  • 28SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Data Transfer Instructions

    Computes a memory address similar to a data processing instructionLoad instruction follow a similar pattern except that the data from memory only gets as far as the data in register on the 2nd cycle and a 3rd cycle is needed to transfer the data from there to the destination register

    address register

    increment

    registersRn

    PC

    lsl #0

    = A / A + B / A - B

    mult

    data out data in i. pipe

    [11:0]

    (a) 1st cycle - compute address

    address register

    increment

    registersRn

    Rd

    shifter

    = A + B / A - B

    mult

    PC

    byte? data in i. pipe

    (b) 2nd cycle - store data & auto-index

  • 29SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Branch Instructionsaddress register

    increment

    registersR14

    PC

    shifter

    = A

    mult

    data out data in i. pipe

    (b) 2nd cycle - save return address

    address register

    increment

    registersPC

    lsl #2

    = A + B

    mult

    data out data in i. pipe

    [23:0]

    (a) 1st cycle - compute branch target

    The third cycle, which is required to complete the pipeline refilling, is also used to mark the small correction to the value stored in the link register in order that is points directly at the instruction which follows the branch

  • 30SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Multi-cycle Instruction

    Memory access (fetch, data transfer) in every cycleDatapath used in every cycle (execute, address calculation, data transfer)Decode logic generates the control signals for the data path use in next cycle (decode, address calculation)

  • 31SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Branch Pipeline Example

    Breaking the pipelineNote that the core is executing in the ARM state

  • 32SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    5-Stage Pipeline ARM OrganizationTprog = Ninst * CPI / fclk Tprog: the time that execute a given program Ninst: the number of ARM instructions executed in the

    program => compiler dependent CPI: average number of clock cycles per instructions =>

    hazard causes pipeline stalls fclk: frequency

    Separate instruction and data memories => 5 stage pipelineUsed in ARM9TDMI

  • 33SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    5-Stage Pipeline Organization (1/2)Fetch The instruction is fetched

    from memory and placed in the instruction pipeline

    Decode The instruction is decoded

    and register operands readfrom the register files. There are 3 operand read ports in the register file so most ARM instructions can source all their operands in one cycle

    Execute An operand is shifted and the

    ALU result generated. If the instruction is a load or store, the memory address is computed in the ALU

    I-cache

    rot/sgn ex

    +4

    byte repl.

    ALU

    I decode

    register read

    D-cache

    fetch

    instructiondecode

    execute

    buffer/data

    write-back

    forwardingpaths

    immediatefields

    nextpc

    regshift

    load/storeaddress

    LDR pc

    SUBS pc

    post-index

    pre-index

    LDM/STM

    register write

    r15

    pc + 8

    pc + 4

    +4

    mux

    shift

    mul

    B, BLMOV pc

  • 34SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    5-Stage Pipeline Organization (2/2)Buffer/Data Data memory is accessed if

    required. Otherwise the ALU result is simply buffered for one cycle

    Write back The result generated by the

    instruction are written backto the register file, including any data loaded from memory

    I-cache

    rot/sgn ex

    +4

    byte repl.

    ALU

    I decode

    register read

    D-cache

    fetch

    instructiondecode

    execute

    buffer/data

    write-back

    forwardingpaths

    immediatefields

    nextpc

    regshift

    load/storeaddress

    LDR pc

    SUBS pc

    post-index

    pre-index

    LDM/STM

    register write

    r15

    pc + 8

    pc + 4

    +4

    mux

    shift

    mul

    B, BLMOV pc

  • 35SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Pipeline HazardsThere are situations, called hazards, that prevent the next instruction in the instruction stream from being executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining. There are three classes of hazards: Structural Hazards: They arise from resource conflicts

    when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution.

    Data Hazards: They arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline.

    Control Hazards: They arise from the pipelining of branches and other instructions that change the PC.

  • 36SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Structural HazardsWhen a machine is pipelined, the overlapped execution of instructions requires pipelining of functional units and duplication of resources to allow all possible combinations of instructions in the pipeline.If some combination of instructions cannot be accommodated because of a resource conflict, the machine is said to have a structural hazard.

  • 37SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    ExampleA machine has shared a single-memory pipeline for data and instructions. As a result, when an instruction contains a data-memory reference (load), it will conflict with the instruction reference for a later instruction (instr 3):

    Clock cycle numberinstr 1 2 3 4 5 6 7 8load IF ID EXE MEM WBInstr 1 IF ID EXE MEM WBInstr 2 IF ID EXE MEM WBInstr 3 IF ID EXE MEM WB

  • 38SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Solution (1/2)To resolve this, we stall the pipeline for one clock cycle when a data-memory access occurs. The effect of the stall is actually to occupy the resources for that instruction slot. The following table shows how the stalls are actually implemented.

    Clock cycle numberinstr 1 2 3 4 5 6 7 8 9load IF ID EXE MEM WB

    WB

    Instr 1 IF ID EXE MEM WBInstr 2 IF ID EXE MEM WBInstr 3 stall IF ID EXE MEM

  • 39SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Solution (2/2)Another solution is to use separate instruction and data memoriesARM used Harvard architecture, so we do not have this hazard

  • 40SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Data HazardsData hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instructions on the unpipelined machine.

    Clock cycle number1 2 3 4 5 6 7 8 9

    ADD R1,R2,R3R4,R5,R1R6,R1,R7R8,R1,R9

    IF ID EXE MEM WB

    R10,R11,R1

    SUB IF IDsub EXE MEM WB

    XOR IF IDxor EXE MEM

    AND IF IDand EXE MEM WBOR IF IDor EXE MEM WB

    WB

  • 41SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    ForwardingThe problem with data hazards, introduced by this sequence of instructions can be solved with a simple hardware technique called forwarding.

    Clock cycle number1 2 3 4 5 6 7

    ADD R1,R2,R3R4,R5,R1R6,R1,R7

    IF ID EXE MEM WBSUB IF IDsub EXE MEM WBAND IF IDand EXE MEM WB

  • 42SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Forward Data

    Clock cycle number1 2 3 4 5 6 7

    ADD R1,R2,R3R4,R5,R1R6,R1,R7

    IF ID EXEadd MEMadd WBSUB IF ID EXEsub MEM WBAND IF ID EXEand MEM WB

    The first forwarding is for value of R1 from EXEadd to EXEsub. The second forwarding is also for value of R1 from MEMaddto EXEand.This code now can be executed without stalls.Forwarding can be generalized to include passing the result directly to the functional unit that requires it A result is forwarded from the output of one unit to the input of

    another, rather than just from the result of a unit to the input of the same unit.

  • 43SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Forwarding ArchitectureForwarding works as follows: The ALU result from the

    EXE/MEM register is always fed back to the ALU input latches.

    If the forwarding hardware detects that the previous ALU operation has written the register corresponding to the source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file.

    I-cache

    rot/sgn ex

    +4

    byte repl.

    ALU

    I decode

    register read

    D-cache

    fetch

    instructiondecode

    execute

    buffer/data

    write-back

    forwardingpaths

    immediatefields

    nextpc

    regshift

    load/storeaddress

    LDR pc

    SUBS pc

    post-index

    pre-index

    LDM/STM

    register write

    r15

    pc + 8

    pc + 4

    +4

    mux

    shift

    mul

    B, BLMOV pc

    forwarding paths

  • 44SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Without Forward

    Clock cycle number1 2 3 4 5 6 7 8 9

    WBWBMEM

    R1,R2,R3R4,R5,R1R6,R1,R7

    ADD IF ID EXE MEM WBSUB IF stall stall IDsub EXE MEMAND stall stall IF IDand EXE

  • 45SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Data ForwardingData dependency arises when an instruction needs to use the result of one of its predecessors before the result has returned to the register file => pipeline hazardsForwarding paths allow results to be passed between stages as soon as they are available5-stage pipeline requires each of the three source operands to be forwarded from any of the intermediate result registersStill one load stallLDR rN, []ADD r2,r1,rN ;use rN immediately

    One stall Compiler rescheduling

  • 46SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Stalls are required

    1 2 3 4 5 6 7 8R1,@(R2)

    OR R8,R1,R9 IF ID EXE MEM WB

    R4,R1,R5R6,R1,R7

    LDR IF ID EXE MEM WBSUB IF ID EXEsub MEM WBAND IF ID EXEand MEM WB

    The load instruction has a delay or latency that cannot be eliminated by forwarding alone.

  • 47SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    The Pipeline with one Stall

    1 2 3 4 5 6 7 8 9

    WBWB

    OR R8,R1,R9 stall IF ID EXE MEM

    R1,@(R2)R4,R1,R5R6,R1,R7

    LDR IF ID EXE MEM WBSUB IF ID stall EXEsub MEM WBAND IF stall ID EXE MEM

    The only necessary forwarding is done for R1 from MEM toEXEsub.

  • 48SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    LDR Interlock

    In this example, it takes 7 clock cycles to execute 6 instructions, CPI of 1.2The LDR instruction immediately followed by a data operation using the same register cause an interlock

  • 49SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Optimal Pipelining

    In this example, it takes 6 clock cycles to execute 6 instructions, CPI of 1The LDR instruction does not cause the pipeline to interlock

  • 50SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    LDM Interlock (1/2)

    In this example, it takes 8 clock cycles to execute 5 instructions, CPI of 1.6During the LDM there are parallel memory and write back cycles

  • 51SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    LDM Interlock (2/2)

    In this example, it takes 9 clock cycles to execute 5 instructions, CPI of 1.8The SUB incurs a further cycle of interlock due to use the highest specified register in the LDM instruction

  • 52SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Control hazards (1/2)Control hazards can cause a greater performance loss for ARM pipeline than data hazards.When a branch is executed, it may or may not change the PC (program counter) to something other than its current value plus 4.The simplest method of dealing with branches is to stall the pipeline as soon as the branch is detected until we reach the EXE stage.

    Branch IF ID EXE MEM WBMEM WBEXE WBMEM

    Branch successor IF (stall) Stall IF ID EXEBranch successor+1 IF ID

  • 53SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Control hazards (2/2)The number of clock cycles can be reduced by two steps Find our whether the branch is taken or not taken earlier

    in the pipeline Compute the taken PC (i.e., the address of the branch

    target) earlierWe will discuss branch prediction schemes

  • 54SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Branch predictionBranch prediction is to predict the branch as no taken, simply allowing the hardware to continue as if the branch were not executed.Care must be taken not to change the machine state until the branch outcome is definitely known.

  • 55SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Predict Not TakenThe pipeline with this scheme implemented behaves as shown below:

    Untaken Branch Instr IF ID EXE MEM WB

    WBInstr i+1 IF ID EXE MEM WBInstr I+2 IF ID EXE MEM

    Taken Branch Instr IF ID EXE MEM WB

    WBBranch target+1 IF ID EXE MEM WB

    Instr i+1 IF idle idle idle idleBranch target IF ID EXE MEM

  • 56SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    Predict TakenAn alternative scheme is to predict the branch as taken.

    ARM employs a static branch prediction mechanism Conditional branches that branch backwards are

    predicted to be taken Conditional branches that branch forwards are predicted

    not to be taken

  • 57SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    SummaryInstruction set 32 bit ARM instruction 16 bit Thumb instruction

    ARM/Thumb interworkingARM organization 3-stage pipeline

    Fetch/Decode/Execute

    5-stage pipeline Fetch/Decode/Execute/Buffer/Write Back Pipeline hazards

    Structure hazard Data hazard Control hazard

  • 58SoC Consortium Course MaterialSoC Design Laboratory 03/10/2004

    References[1] http://twins.ee.nctu.edu.tw/courses/ip_core_02/index.html[2] ARM System-on-Chip Architecture, Second Edition,

    edited by S.Furber, Addison Wesley Longman: ISBN 0-201-67519-6.

    [3] Architecture Reference Manual, Second Edition, edited by D. Seal, Addison Wesley Longman: ISBN 0-201-73719-1.

    [4] www.arm.com

    ARM Processor Architecture (I)OutlineThumb instruction setThumb-ARM DifferenceRegisters Access in ThumbThumb Accessible RegistersBranches (1/2)Branches (2/2)Data Processing InstructionLoad or Store RegisterBlock Data TransfersThumb Instruction Entry and ExitMiscellaneousARM Thumb-2 core technologyThumb Instruction Set (1/3)Thumb Instruction Set (2/3)Thumb Instruction Set (3/3)Thumb Instruction FormatARM/Thumb interworkingThe Need for InterworkingARM/Thumb InterworkingExampleARM organization3-Stage Pipeline ARM Organization3-Stage Pipeline (1/2)3-Stage Pipeline (2/2)Data Processing InstructionData Transfer InstructionsBranch InstructionsMulti-cycle InstructionBranch Pipeline Example5-Stage Pipeline ARM Organization5-Stage Pipeline Organization (1/2)5-Stage Pipeline Organization (2/2)Pipeline HazardsStructural HazardsExampleSolution (1/2)Solution (2/2)Data HazardsForwardingForward DataForwarding ArchitectureWithout ForwardData ForwardingStalls are requiredThe Pipeline with one StallLDR InterlockOptimal PipeliningLDM Interlock (1/2)LDM Interlock (2/2)Control hazards (1/2)Control hazards (2/2)Branch predictionPredict Not TakenPredict TakenSummaryReferences