ARM-final

Embed Size (px)

DESCRIPTION

arm processor

Citation preview

  • 5/21/2018 ARM-final

    1/103

    1TM 139v10 The ARM Architecture

    ARM PROCESSOR,

    ESRTP,

    SEM-8,B.E.(ELECTRONICS)

    WITH BEST OF LUCK FROM:

    Prof. Vidya Gogate

    SAKEC, Chembur

  • 5/21/2018 ARM-final

    2/103

    2TMT H E A R C H I T E C T U R E F O R T H E D I G I T A L W O R L D

    The ARM Architecture

  • 5/21/2018 ARM-final

    3/103

    3TM 339v10 The ARM Architecture

    Agenda

    Introduction to ARM Ltd

    Programmers Model

    Instruction Set

    System Design

    Development Tools

  • 5/21/2018 ARM-final

    4/103

    4TM 439v10 The ARM Architecture

    ARM Ltd

    Founded in November 1990

    Spun out of Acorn Computers

    Designs the ARM range of RISC processor

    cores

    Licenses ARM core designs to semiconductor

    partners who fabricate and sell to their

    customers.

    ARM does not fabricate silicon itself

    Also develop technologies to assist with thedesign-in of the ARM architecture

    Software tools, boards, debug hardware,

    application software, bus architectures,

    peripherals etc

  • 5/21/2018 ARM-final

    5/1035TM 539v10 The ARM Architecture

    ARM Partnership Model

  • 5/21/2018 ARM-final

    6/1036TM 639v10 The ARM Architecture

    ARM Powered Products

  • 5/21/2018 ARM-final

    7/1037TM 739v10 The ARM Architecture

    Latest NEWS

    For 30 years, Intel basically had the market to itself and as a result, its

    chips were priced much higher than ARM chips.

    But ARMs entry into Intels territory is a real threat to Intels dominance.

    In fact, Microsofts first version of the Surface tablet/keyboard device

    uses ARM chips. The Intel-based versions will come out about three

    month later.

    http://techland.time.com/2012/06/18/surface-microsoft-made-windows-8-tablets-coming-this-year/http://techland.time.com/2012/06/18/surface-microsoft-made-windows-8-tablets-coming-this-year/
  • 5/21/2018 ARM-final

    8/1038TM 839v10 The ARM Architecture

    Latest NEWS

    There has been an important new development in the processor world

    lately. The folks behind the ARM processor the chip that powers most

    smart phones and tablets today decided to scale up this processor

    technology to run at speeds that could be used in advanced tablets andmore importantly, laptops and even desktops.

    And ARM processors got a major boost when Microsoft made the

    decision to create an ARM-based version of Windows 8. For the first time,

    Microsoft broke away from whats been called the Win-Tel monopoly.

  • 5/21/2018 ARM-final

    9/1039TM 939v10 The ARM Architecture

    Latest NEWS

    But Microsofts new operating system for ARM, called Windows RT, has

    touched off a new battlefront in processor wars.

    This has forced Intel to try and make its x86 processors more energyefficient;

    the company hopes to have chips that are on par with ARM by mid-2013.

    Read more: http://techland.time.com/2012/07/16/arm-vs-intel-how-the-

    processor-wars-will-benefit-consumers-most/#ixzz23XHpDxkn

    http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/
  • 5/21/2018 ARM-final

    10/10310TM 1039v10 The ARM Architecture

    RISC Design Philosophy

    InstructionsRISC processors have a reduced number of instruction

    classes which provide simple operations that can each execute in a

    single cycle.

    In contrast, in CISC processors the instructions are often of variable

    size and take many cycles to execute.

    PipelinesThe processing of instructions is broken down into smallerunits that can be executed in parallel by pipelines.

    There is no need for an instruction to be executed by a mini-program

    called microcode as on CISC processors.

    RegistersRISC machines have a large general-purpose register set.Any register can contain either data or an address. Registers act as the

    fast local memory store for all data processing operations.

    In contrast, CISC processors have dedicated registers for specific

    purposes.

  • 5/21/2018 ARM-final

    11/10311TM 1139v10 The ARM Architecture

    RISC Design Philosophy

    Load-store architectureThe processor operates on

    data held in registers. Separate load and store instructions

    transfer data between the register bank and external

    memory.

    Memory accesses are costly, so separating memory

    accesses from data processing provides an advantage

    because you can use data items held in the register bank

    multiple times without needing multiple memory accesses.

    In contrast, with a CISC design the data processing

    operations can act on memory directly.

  • 5/21/2018 ARM-final

    12/10312TM 1239v10 The ARM Architecture

    RISC Design Philosophy

    These design rules allow a RISC processor to be simpler,

    and thus the core can operate at higher clock

    frequencies.

    In contrast, traditional CISC processors are more complexand operate at lower clock frequencies.

  • 5/21/2018 ARM-final

    13/10313TM 1339v10 The ARM Architecture

    ARM Design Philosophy

    Portable embedded systems require some form of battery power. The

    ARM processor has been specifically designed to be small to reduce

    power consumption and extend battery operationessential for

    applications such as PDAs.

    Since embedded systems have limited memory due to cost and/orphysical size restrictions; High code density is useful feature of ARM

    for applications that have limited on-board memory. The ability to use

    low-cost memory devices produces substantial savings.

    For a single-chip solution, the smaller the area used by the embedded

    processor,(reduced die size) the more available space for specializedperipherals. This in turn reduces the cost of the design and

    manufacturing since fewer discrete chips are required for the end

    product.

  • 5/21/2018 ARM-final

    14/10314TM 1439v10 The ARM Architecture

    ARM Design Philosophy

    ARM has incorporated hardware debug technology

    within the processor so that software engineers can view

    what is happening while the processor is executing code.

    With greater visibility, software engineers can resolve

    issues faster, which has a direct effect on the time to

    market and reduces overall development costs.

  • 5/21/2018 ARM-final

    15/10315TM 1539v10 The ARM Architecture

    Instruction set FEATURES

    Variable cycle execution for certain instructions-

    load-store-multiple instructions vary in the number of execution cycles

    depending upon the number of registers being transferred.

    The transfer can occur on sequential memory addresses, which

    increases performance since sequential memory accesses are often

    faster than random accesses.

    Code density is also improved since multiple register transfers are

    common operations at the start and end of functions.

    Inline barrel shifter leading to more complex instructions

    The inline barrel shifter is a hardware component that preprocesses

    one of the input registers before it is used by an instruction.

    This expands the capability of many instructions to improve core

    performance and code density.

  • 5/21/2018 ARM-final

    16/10316TM 1639v10 The ARM Architecture

    Instruction set FEATURES

    Thumb 16-bit instruction setThe 16-bit instructions improvecode density by about 30% over 32-bit fixed-length instructions.

    Conditional execution- An instruction is only executed when a

    specific condition has been satisfied. This feature improvesperformance and code density by reducing branch instructions.

    Enhanced instructions - The enhanced digital signal processor

    (DSP) instructions were added to the standard ARM instruction set tosupport fast 1616-bit multiplier operations and saturation. These

    instructions allow a faster-performing ARM processor in some cases to

    replace the traditional combinations of a processor plus a DSP.

  • 5/21/2018 ARM-final

    17/10317TM 1739v10 The ARM Architecture

    Nomenclature.

    Instruction set architecture (ISA) is upward compatible.

    ARM{x}{y}{z}{T}{D}{M}{I}{E}{J}{F}{-S}

    xfamily

    ymemory management/protection unit

    zcache

    TThumb 16-bit decoder

    DJTAG debug

    Mfast multiplier

    IEmbedded ICE macro-cell

    Eenhanced instructions (assumes TDMI)

    JJazelle

    Fvector floating-point unit

    Ssynthesizible version

  • 5/21/2018 ARM-final

    18/10318TM 1839v10 The ARM Architecture

    Nomenclature

    All ARM cores after the ARM7TDMI include the TDMIfeatures even though they may not include those letters

    after the ARM label.

    The processor family is a group of processor

    implementations that share the same hardware

    characteristics.

    For example, the ARM7TDMI, ARM740T, and ARM720T all

    share the same family characteristics and belong to the

    ARM7 family.

  • 5/21/2018 ARM-final

    19/10319TM 1939v10 The ARM Architecture

    Nomenclature

    JTAG is described by IEEE 1149.1 Standard Test Access Portand boundary scan architecture.

    It is a serial protocol used by ARM to send and receive

    debug information between the processor core and test

    equipment.

    Embedded ICE macro-cell is the debug hardware built into

    the processor that allows breakpoints and watch-points to

    be set.

    Synthesizable means that the processor core is supplied as

    source code that can be compiled into a form easily used

    by EDA tools.

  • 5/21/2018 ARM-final

    20/10320TM 2039v10 The ARM Architecture

    Syllabus

    ARM processor fundamentalsintroduction to ARM and THUMBinstruction set--

    processor and memory organizationCPU Bus configuration

    ARM BusMemory devicesInput/output devices

    Component interfacingdesigning with microprocessordevelopment and debugging

    Design Example

    Instruction set with enhanced DSP features with ARM core, mix

    mode programming as Thumb + ARM core,

    Assembly programming concept, compare with ARM7, ARM9,

    ARM11 with new features additions

  • 5/21/2018 ARM-final

    21/10321TM 2139v10 The ARM Architecture

    PIN DIAGRAM

  • 5/21/2018 ARM-final

    22/10322TM 2239v10 The ARM Architecture

    Architecture block diagram

  • 5/21/2018 ARM-final

    23/10323TM 2339v10 The ARM Architecture

    Hardware Fundamentals

    The ARM processor can be abstracted

    into eight components

    ALU, barrel shifter, MAC, register file,

    instruction decoder, address register,

    incrementer, and sign extend.

  • 5/21/2018 ARM-final

    24/103

    24TM 2439v10 The ARM Architecture

    Data Sizes and Instruction Sets

    The ARM is a 32-bit architecture.

    When used in relation to the ARM:

    Bytemeans 8 bits

    Halfwordmeans 16 bits (two bytes) Wordmeans 32 bits (four bytes)

    Most ARMs implement two instruction sets

    32-bit ARM Instruction Set

    16-bit Thumb Instruction Set

    Jazelle cores can also execute Java bytecode

  • 5/21/2018 ARM-final

    25/103

    25TM 2539v10 The ARM Architecture

    Processor Modes

    The ARM has seven basic operating modes:

    User: unprivileged mode under which most tasks run

    FIQ: entered when a high priority (fast) interrupt is raised

    IRQ: entered when a low priority (normal) interrupt is raised

    Supervisor: entered on reset and when a Software Interrupt

    instruction is executed

    Abort: used to handle memory access violations

    Undef: used to handle undefined instructions

    System: privileged mode using the same registers as user mode

  • 5/21/2018 ARM-final

    26/103

    26TM 2639v10 The ARM Architecture

    r0

    r1

    r2

    r3

    r4

    r5

    r6

    r7

    r8

    r9

    r10

    r11

    r12

    r13 (sp)

    r14 (lr)

    r15 (pc)

    cpsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r8

    r9

    r10

    r11

    r12

    r13 (sp)

    r14 (lr)

    spsr

    FIQ IRQ SVC Undef Abort

    User Moder0

    r1

    r2

    r3

    r4

    r5

    r6

    r7

    r8

    r9

    r10

    r11

    r12

    r13 (sp)

    r14 (lr)

    r15 (pc)

    cpsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r8

    r9

    r10

    r11

    r12

    r13 (sp)

    r14 (lr)

    spsr

    Current Visible Registers

    Banked out Registers

    FIQ IRQ SVC Undef Abort

    r0

    r1

    r2

    r3

    r4

    r5

    r6

    r7

    r15 (pc)

    cpsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r8

    r9

    r10

    r11

    r12

    r13 (sp)

    r14 (lr)

    spsr

    Current Visible Registers

    Banked out Registers

    User IRQ SVC Undef Abort

    r8

    r9

    r10

    r11

    r12

    r13 (sp)

    r14 (lr)

    FIQ ModeIRQ Moder0

    r1

    r2

    r3

    r4

    r5

    r6

    r7

    r8

    r9

    r10

    r11

    r12

    r15 (pc)

    cpsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r8

    r9

    r10

    r11

    r12

    r13 (sp)

    r14 (lr)

    spsr

    Current Visible Registers

    Banked out Registers

    User FIQ SVC Undef Abort

    r13 (sp)

    r14 (lr)

    Undef Moder0

    r1

    r2

    r3

    r4

    r5

    r6

    r7

    r8

    r9

    r10

    r11

    r12

    r15 (pc)

    cpsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r8

    r9

    r10

    r11

    r12

    r13 (sp)

    r14 (lr)

    spsr

    Current Visible Registers

    Banked out Registers

    User FIQ IRQ SVC Abort

    r13 (sp)

    r14 (lr)

    SVC Moder0

    r1

    r2

    r3

    r4

    r5

    r6

    r7

    r8

    r9

    r10

    r11

    r12

    r15 (pc)

    cpsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r8

    r9

    r10

    r11

    r12

    r13 (sp)

    r14 (lr)

    spsr

    Current Visible Registers

    Banked out Registers

    User FIQ IRQ Undef Abort

    r13 (sp)

    r14 (lr)

    Abort Moder0

    r1

    r2

    r3

    r4

    r5

    r6

    r7

    r8

    r9

    r10

    r11

    r12

    r15 (pc)

    cpsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r13 (sp)

    r14 (lr)

    spsr

    r8

    r9

    r10

    r11

    r12

    r13 (sp)

    r14 (lr)

    spsr

    Current Visible Registers

    Banked out Registers

    User FIQ IRQ SVC Undef

    r13 (sp)

    r14 (lr)

    The ARM Register Set

  • 5/21/2018 ARM-final

    27/103

    27TM 2739v10 The ARM Architecture

    Register Organization Summary

    Usermoder0-r7,r15,andcpsr

    r8

    r9

    r10

    r11

    r12

    r13 (sp)

    r14 (lr)

    spsr

    FIQ

    r8

    r9

    r10

    r11

    r12

    r13 (sp)

    r14 (lr)r15 (pc)

    cpsr

    r0r1

    r2

    r3

    r4

    r5

    r6

    r7

    User

    r13 (sp)

    r14 (lr)

    spsr

    IRQ

    Usermoder0-r12,r15,

    andcpsr

    r13 (sp)

    r14 (lr)

    spsr

    Undef

    Usermoder0-r12,r15,

    andcpsr

    r13 (sp)

    r14 (lr)

    spsr

    SVC

    Usermoder0-r12,r15,

    andcpsr

    r13 (sp)

    r14 (lr)

    spsr

    Abort

    Usermoder0-r12,r15,

    andcpsr

    Thumb state

    Low registers

    Thumb state

    High registers

    Note: System mode uses the User mode register set

  • 5/21/2018 ARM-final

    28/103

    28TM 2839v10 The ARM Architecture

    The Registers

    ARM has 37 registers all of which are 32-bits long. 1 dedicated program counter

    1 dedicated current program status register

    5 dedicated saved program status registers

    30 general purpose registers

    The current processor mode governs which of several banks is

    accessible. Each mode can access

    a particular set of r0-r12registers

    a particular r13(the stack pointer, sp) and r14(the link register,lr)

    the program counter,r15 (pc)

    the current program status register, cpsr

    Privileged modes (except System) can also access

    a particular spsr(saved program status register)

  • 5/21/2018 ARM-final

    29/103

    29TM 2939v10 The ARM Architecture

    Program Status Registers

    Condition code flags

    N =Negative result from ALU

    Z = Zero result from ALU

    C = ALU operation Carried out

    V = ALU operation oVerflowed

    Sticky Overflow flag - Q flag

    Architecture 5TE/J only

    Indicates if saturation has occurred

    J bit

    Architecture 5TEJ only

    J = 1: Processor in Jazelle state

    Interrupt Disable bits.

    I = 1: Disables the IRQ.

    F = 1: Disables the FIQ.

    T Bit

    Architecture xT only

    T = 0: Processor in ARM state

    T = 1: Processor in Thumb state

    Mode bits Specify the processor mode

    2731

    N Z C V Q28 67

    I F T mode1623 815 5 4 024

    f s x c

    U n d e f i n e dJ

  • 5/21/2018 ARM-final

    30/103

    30TM 3039v10 The ARM Architecture

    When the processor is executing in ARM state:

    All instructions are 32 bits wide

    All instructions must be word aligned

    Therefore the pcvalue is stored in bits [31:2] with bits [1:0] undefined (as

    instruction cannot be halfword or byte aligned).

    When the processor is executing in Thumb state:

    All instructions are 16 bits wide

    All instructions must be halfword aligned

    Therefore the pcvalue is stored in bits [31:1] with bit [0] undefined (asinstruction cannot be byte aligned).

    When the processor is executing in Jazelle state:

    All instructions are 8 bits wide

    Processor performs a word access to read 4 instructions at once

    Program Counter (r15)

  • 5/21/2018 ARM-final

    31/103

    31TM 3139v10 The ARM Architecture

    Vector Table

    Exception Handling

    When an exception occurs, the ARM: Copies CPSR into SPSR_

    Sets appropriate CPSR bits

    Change to ARM state

    Change to exception mode

    Disable interrupts (if appropriate)

    Stores the return address in LR_

    Sets PC to vector address

    To return, exception handler needs to:

    Restore CPSR from SPSR_

    Restore PC from LR_

    This can only be done in ARM state.

    Vector table can be at

    0xFFFF0000on ARM720Tand on ARM9/10 family devices

    FIQIRQ

    (Reserved)Data Abort

    Prefetch Abort

    Software Interrupt

    Undefined Instruction

    Reset

    0x1C0x18

    0x140x10

    0x0C

    0x08

    0x04

    0x00

  • 5/21/2018 ARM-final

    32/103

    32TM 3239v10 The ARM Architecture

    Instruction set

    ARM has three instruction setsARM, Thumb, and Jazelle.

    The register file contains 37 registers, but only 17 or 18 registers

    are accessible at any point in time; the rest are banked

    according to processor mode.

    The current processor mode is stored in the CPSR

    It holds the current status of the processor core as well as

    interrupt masks, condition flags, and state.

    The state determines which instruction set is being executed.

  • 5/21/2018 ARM-final

    33/103

    33TM 3339v10 The ARM Architecture

    ARM instructions can be made to execute conditionally by postfixing

    them with the appropriate condition code field.

    This improves code density andperformance by reducing the number of

    forward branch instructions.CMP r3,#0 CMP r3,#0BEQ skip ADDNE r0,r1,r2ADD r0,r1,r2

    skip

    By default, data processing instructions do not affect the condition code

    flags but the flags can be optionally set by using S. CMP does notneed S.

    loopSUBS r1,r1,#1BNE loop

    if Z flag clear then branch

    decrement r1 and set flags

    Conditional Execution and Flags

  • 5/21/2018 ARM-final

    34/103

    34TM 3439v10 The ARM Architecture

    Condition Codes

    Not equal

    Unsigned higher or sameUnsigned lower

    Minus

    Equal

    Overflow

    No overflow

    Unsigned higher

    Unsigned lower or same

    Positive or Zero

    Less than

    Greater than

    Less than or equal

    Always

    Greater or equal

    EQNE

    CS/HSCC/LO

    PLVS

    HI

    LSGELTGTLEAL

    MI

    VC

    Suffix Description

    Z=0

    C=1C=0

    Z=1

    Flags tested

    N=1

    N=0

    V=1

    V=0

    C=1 & Z=0

    C=0 or Z=1N=V

    N!=V

    Z=0 & N=V

    Z=1 or N=!V

    The possible condition codes are listed below: Note AL is the default and does not need to be specified

    Examples of conditional

  • 5/21/2018 ARM-final

    35/103

    35TM 3539v10 The ARM Architecture

    Examples of conditional

    execution

    Use a sequence of several conditional instructions

    if (a==0) func(1);CMP r0,#0MOVEQ r0,#1BLEQ func

    Set the flags, then use various condition codes

    if (a==0) x=0;if (a>0) x=1;

    CMP r0,#0MOVEQ r1,#0MOVGT r1,#1

    Use conditional compare instructionsif (a==4 || a==10) x=0;

    CMP r0,#4CMPNE r0,#10MOVEQ r1,#0

  • 5/21/2018 ARM-final

    36/103

    36TM 3639v10 The ARM Architecture

    Branch : B{} label

    Branch with Link : BL{} subroutine_label

    The processor core shifts the offset field left by 2 positions, sign-extends

    it and adds it to the PC

    32 Mbyte range

    How to perform longer branches?

    2831 24 0

    Cond 1 0 1 L Offset

    Condition field

    Link bit 0 = Branch1 = Branch with link

    232527

    Branch instructions

  • 5/21/2018 ARM-final

    37/103

    37TM 3739v10 The ARM Architecture

    Data processing Instructions

    Consist of : Arithmetic: ADD ADC SUB SBC RSB RSC Logical: AND ORR EOR BIC Comparisons: CMP CMN TST TEQ Data movement: MOV MVN

    These instructions only work on registers, NOT memory.

    Syntax:

    {}{S} Rd, Rn, Operand2

    Comparisons set flags only - they do not specify Rd

    Data movement does not specify Rn

    Second operand is sent to the ALU via barrel shifter.

  • 5/21/2018 ARM-final

    38/103

    38TM 3839v10 The ARM Architecture

    The Barrel Shifter

    DestinationCF 0 Destination CF

    LSL : Logical Left Shift ASR: Arithmetic Right Shift

    Multiplication by a power of 2 Division by a power of 2,preserving the sign bit

    Destination CF...0 Destination CF

    LSR : Logical Shift Right ROR: Rotate Right

    Division by a power of 2 Bit rotate with wrap aroundfrom LSB to MSB

    Destination

    RRX: Rotate Right Extended

    Single bit rotate with wrap aroundfrom CF to MSB

    CF

    Using the Barrel Shifter:

  • 5/21/2018 ARM-final

    39/103

    39TM 3939v10 The ARM Architecture

    Register, optionally with shift operation Shift value can be either be:

    5 bit unsigned integer

    Specified in bottom byte of another

    register.

    Used for multiplication by constant

    Immediate value

    8 bit number, with a range of 0-255.

    Rotated right through even number of

    positions

    Allows increased range of 32-bitconstants to be loaded directly into

    registers

    Result

    Operand1

    BarrelShifter

    Operand2

    ALU

    Using the Barrel Shifter:

    The Second Operand

  • 5/21/2018 ARM-final

    40/103

    40TM 4039v10 The ARM Architecture

    No ARM instruction can contain a 32 bit immediate constant

    All ARM instructions are fixed as 32 bits long

    The data processing instruction format has 12 bits available for operand2

    4 bit rotate value (0-15) is multiplied by two to give range 0-30 in steps of 2

    Rule to remember is 8-bits shifted by an even number of bit positions.

    0711 8

    immed_8

    Shifter

    ROR

    rot

    x2

    Quick Quiz:

    0xe3a004ffMOV r0, #???

    Immediate constants (1)

  • 5/21/2018 ARM-final

    41/103

    41TM 4139v10 The ARM Architecture

    Examples:

    The assembler converts immediate values to the rotate form: MOV r0,#4096 ; uses 0x40 ror 26 ADD r1,r2,#0xFF0000 ; uses 0xFF ror 16

    The bitwise complements can also be formed using MVN: MOV r0, #0xFFFFFFFF ; assembles to MVN r0,#0

    Values that cannot be generated in this way will cause an error.

    031

    ror #0

    range 0-0xff000000 step 0x01000000ror #8

    range 0-0x000000ff step 0x00000001

    range 0-0x000003fc step 0x00000004ror #30

    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

    Immediate constants (2)

  • 5/21/2018 ARM-final

    42/103

    42TM 4239v10 The ARM Architecture

    To allow larger constants to be loaded, the assembler offers a pseudo-instruction: LDR rd, =const

    This will either: Produce aMOVorMVNinstruction to generate the value (if possible).

    or Generate a LDRinstruction with a PC-relative address to read the constant

    from a literal pool(Constant data area embedded in the code).

    For example LDR r0,=0xFF => MOV r0,#0xFF LDR r0,=0x55555555 => LDR r0,[PC,#Imm12]

    DCD 0x55555555

    This is the recommended way of loading constants into a register

    Loading 32 bit constants

  • 5/21/2018 ARM-final

    43/103

    43TM 4339v10 The ARM Architecture

    Multiply

    Syntax:

    MUL{}{S} Rd, Rm, Rs Rd = Rm * Rs MLA{}{S} Rd,Rm,Rs,Rn Rd = (Rm * Rs) + Rn

    [U|S]MULL{}{S} RdLo, RdHi, Rm, Rs RdHi,RdLo := Rm*Rs

    [U|S]MLAL{}{S} RdLo, RdHi, Rm, Rs RdHi,RdLo := (Rm*Rs)+RdHi,RdLo

    Cycle time

    Basic MUL instruction

    2-5 cycles on ARM7TDMI

    1-3 cycles on StrongARM/XScale

    2 cycles on ARM9E/ARM102xE

    +1 cycle for ARM9TDMI (over ARM7TDMI)

    +1 cycle for accumulate (not on 9E though result delay is one cycle longer)

    +1 cycle for long

    Above are general rules - refer to the TRM for the core you are using

    for the exact details

    S f

  • 5/21/2018 ARM-final

    44/103

    44TM 4439v10 The ARM Architecture

    Single register data transfer

    LDR STR WordLDRB STRB Byte

    LDRH STRH HalfwordLDRSB Signed byte load

    LDRSH Signed halfword load

    Memory system must support all access sizes

    Syntax:

    LDR

    {}{} Rd,

    STR{}{} Rd,

    e.g. LDREQB

    Add d

  • 5/21/2018 ARM-final

    45/103

    45TM 4539v10 The ARM Architecture

    Address accessed

    Address accessed by LDR/STR is specified by a base register plus an

    offset

    For word and unsigned byte accesses, offset can be An unsigned 12-bit immediate value (ie 0 - 4095 bytes).

    LDR r0,[r1,#8]

    A register, optionally shifted by an immediate valueLDR r0,[r1,r2]LDR r0,[r1,r2,LSL#2]

    This can be either added or subtracted from the base register:LDR r0,[r1,#-8]LDR r0,[r1,-r2]LDR r0,[r1,-r2,LSL#2]

    For halfword and signed halfword / byte, offset can be: An unsigned 8 bit immediate value (ie 0-255 bytes).

    A register (unshifted).

    Choice of pre-indexedor post- indexedaddressing

    P P t I d d Add i ?

  • 5/21/2018 ARM-final

    46/103

    46TM 4639v10 The ARM Architecture

    0x5

    0x5

    r1

    0x200Base

    Register 0x200

    r00x5

    SourceRegisterfor STR

    Offset12 0x20c

    r1

    0x200

    OriginalBase

    Register0x200

    r0

    0x5Source

    Registerfor STR

    Offset

    12 0x20c

    r1

    0x20c

    UpdatedBase

    Register

    Auto-update form:STR r0,[r1,#12]!

    Pre or Post Indexed Addressing?

    Pre-indexed: STR r0,[r1,#12]

    Post-indexed: STR r0,[r1],#12

    LDM / STM ti

  • 5/21/2018 ARM-final

    47/103

    47TM 4739v10 The ARM Architecture

    LDM / STM operation

    Syntax:

    {} Rb{!},

    4 addressing modes:

    LDMIA/ STMIA increment after

    LDMIB/ STMIB increment before

    LDMDA/ STMDA decrement after

    LDMDB/ STMDB decrement beforeIA

    r1 Increasing

    Address

    r4

    r0

    r1

    r4

    r0

    r1

    r4

    r0 r1

    r4

    r0

    r10

    IB DA DB

    LDMxx r10, {r0,r1,r4}STMxx r10, {r0,r1,r4}

    Base Register (Rb)

    S ft I t t (SWI)

  • 5/21/2018 ARM-final

    48/103

    48TM 4839v10 The ARM Architecture

    Software Interrupt (SWI)

    Causes an exception trap to the SWI hardware vector The SWI handler can examine the SWI number to decide what operation

    has been requested.

    By using the SWI mechanism, an operating system can implement a set

    of privileged operations which applications running in user mode can

    request. Syntax:

    SWI{}

    2831 2427 0

    Cond 1 1 1 1 SWI number (ignored by processor)

    23

    Condition Field

    PSR T f I t ti

  • 5/21/2018 ARM-final

    49/103

    49TM 4939v10 The ARM Architecture

    PSR Transfer Instructions

    MRS and MSR allow contents of CPSR / SPSR to be transferred to / from

    a general purpose register.

    Syntax:

    MRS{} Rd, ; Rd = MSR{} ,Rm ; = Rm

    where = CPSR or SPSR

    [_fields] = any combination of fsxc

    Also an immediate form MSR{} ,#Immediate

    In User Mode, all bits can be read but only the condition flags (_f) can be

    written.

    2731

    N Z C V Q28 67

    I F T mode1623 815 5 4 024

    f s x c

    U n d e f i n e dJ

    ARM B h d S b ti

  • 5/21/2018 ARM-final

    50/103

    50TM 5039v10 The ARM Architecture

    ARM Branches and Subroutines

    B

    PC relative. 32 Mbyte range.

    BL

    Stores return address in LR

    Returning implemented by restoring the PC from LR

    For non-leaf functions, LR will have to be stacked

    STMFDsp!,{regs,lr}

    :

    BL func2

    :

    LDMFDsp!,{regs,pc}

    func1 func2

    :

    :

    BL func1

    :

    :

    :

    :

    :

    ::

    MOV pc, lr

    Th b

  • 5/21/2018 ARM-final

    51/103

    51TM 5139v10 The ARM Architecture

    Thumb

    Thumb is a 16-bit instruction set

    Optimised for code density from C code (~65% of ARM code size)

    Improved performance from narrow memory

    Subset of the functionality of the ARM instruction set

    Core has additional execution state - Thumb

    Switch between ARM and Thumb using BXinstruction

    015

    31 0

    ADDS r2,r2,#1

    ADD r2,#1

    32-bit ARM Instruction

    16-bit Thumb Instruction

    For most instructions generated by compiler:

    Conditional execution is not used

    Source and destination registers identical

    Only Low registers used Constants are of limited size

    Inline barrel shifter not used

    A d

  • 5/21/2018 ARM-final

    52/103

    52TM 5239v10 The ARM Architecture

    Agenda

    Introduction

    Programmers Model

    Instruction Sets

    System Design

    Development Tools

    E l ARM b d S t

  • 5/21/2018 ARM-final

    53/103

    53TM 5339v10 The ARM Architecture

    Example ARM-based System

    16 bit RAM

    8 bit ROM

    32 bit RAM

    ARMCore

    I/OPeripherals

    Interrupt

    ControllernFIQnIRQ

    ARM Based microcontroller

  • 5/21/2018 ARM-final

    54/103

    54TM 5439v10 The ARM Architecture

    ARM Based microcontroller

    A Basic ARM MEMORY SYSTEM

  • 5/21/2018 ARM-final

    55/103

    55TM 5539v10 The ARM Architecture

    A Basic ARM MEMORY SYSTEM.

    AMBA

  • 5/21/2018 ARM-final

    56/103

    56TM 5639v10 The ARM Architecture

    AMBA

    Bridge

    Timer

    On-chipRAM

    ARM

    InterruptController

    Remap/Pause

    TIC

    Arbiter

    Bus InterfaceExternalROM

    ExternalRAM

    Reset

    System Bus Peripheral Bus

    AMBA

    Advanced Microcontroller Bus

    Architecture

    ADK

    Complete AMBA Design Kit

    ACT

    AMBA Compliance Testbench

    PrimeCell

    ARMs AMBA compliant peripherals

    AHB or ASB APB

    ExternalBus

    Interface

    Decoder

    System Design Hardware

  • 5/21/2018 ARM-final

    57/103

    57TM 5739v10 The ARM Architecture

    System Design-Hardware

    An embedded system includes the following hardwarecomponents:

    ARMprocessors are found embedded in chips.

    Programmers accessperipherals through memory-mappedregisters.

    There is a special type of peripheral called a controller,

    which embedded systems use to configure higher-level

    functions such as memory and interrupts. The AMBA on-chip bus is used to connect the processor

    and peripherals together.

    System Design Software

  • 5/21/2018 ARM-final

    58/103

    58TM 5839v10 The ARM Architecture

    System Design-Software

    An embedded system also includes the following softwarecomponents:

    Init ial ization co deconfigures the hardware to a known state.

    Once configured, operat ing sys tems can be loaded and executed.

    Operating systems provide a common programming environment for

    the use of hardware resources and infrastructure.

    Device d r iversprovide a standard interface to peripherals.

    An appl icat ion Program performs the task-specific duties of anembedded system.

    Agenda

  • 5/21/2018 ARM-final

    59/103

    59TM 5939v10 The ARM Architecture

    Agenda

    Introduction

    Programmers Model

    Instruction Sets

    System Design

    Development Tools

    The RealView Product Families

  • 5/21/2018 ARM-final

    60/103

    60TM 6039v10 The ARM Architecture

    The RealView Product Families

    Debug ToolsAXD (part of ADS)

    Trace Debug Tools

    Multi-ICE

    Multi-Trace

    PlatformsARMulator (part of ADS)

    Integrator Family

    Compilation ToolsARM Developer Suite (ADS)

    Compilers (C/C++ ARM & Thumb),

    Linker & Utilities

    RealView Compilation Tools (RVCT) RealView Debugger (RVD)

    RealView ICE (RVI)

    RealView Trace (RVT)

    RealView ARMulator ISS (RVISS)

    ARM Debug Architecture

  • 5/21/2018 ARM-final

    61/103

    61TM 6139v10 The ARM Architecture

    ARM Debug Architecture

    ARM

    core

    ETM

    TAP

    controller

    Trace PortJTAG port

    Ethernet

    Debugger (+ optional

    trace tools)

    EmbeddedICE Logic

    Provides breakpoints and processor/system

    access

    JTAG interface (ICE)

    Converts debugger commands to JTAG

    signals

    Embedded trace Macrocell (ETM) Compresses real-time instruction and data

    access trace

    Contains ICE features (trigger & filter logic)

    Trace port analyzer (TPA)

    Captures trace in a deep buffer

    EmbeddedICE

    Logic

  • 5/21/2018 ARM-final

    62/103

    Thumb instruction set

  • 5/21/2018 ARM-final

    63/103

    63TM 6339v10 The ARM Architecture

    Thumb instruction set

    Thumb instruction set encodes a subset of the 32-bit ARM

    instructions into a 16-bit instruction set space. So it has

    higher code density: 30% less memory

    Since Thumb has higher performance than ARM on a

    processor with a 16-bit data bus, but

    lower performance than ARM on a 32-bit data bus, use Thumb for memory-constrained systems.

    Thumb Instruction set

  • 5/21/2018 ARM-final

    64/103

    64TM 6439v10 The ARM Architecture

    Thumb Instruction set

  • 5/21/2018 ARM-final

    65/103

    65TM 6539v10 The ARM Architecture

    Thumb Instruction set

    Thumb register usage

  • 5/21/2018 ARM-final

    66/103

    66TM 6639v10 The ARM Architecture

    Thumb register usage

    Code Density

  • 5/21/2018 ARM-final

    67/103

    67TM 6739v10 The ARM Architecture

    Code Density

    Thumb instruction decoding

  • 5/21/2018 ARM-final

    68/103

    68TM 6839v10 The ARM Architecture

    Thumb instruction decoding

    Thumb Instructions limitations

  • 5/21/2018 ARM-final

    69/103

    69TM 6939v10 The ARM Architecture

    Thumb Instruction s limitations

    Only the branch relative instruction can be conditionally executed.

    The limited space available in 16 bits causes the barrel shift

    operationsASR, LSL, LSR, and ROR to be separate instructions

    in the Thumb ISA.

    there is no direct access to the CPSR or SPSR. So there are no

    MSR- and MRS-equivalent Thumb instructions.

    To alter the CPSR or SPSR, you must switch into ARM state to use

    MSR and MRS.

    Similarly, there are no coprocessor instructions in Thumb state.

    You need to be in ARM state to access the coprocessor for

    configuring cache and memory management.

    ARM Thumb internetworking

  • 5/21/2018 ARM-final

    70/103

    70TM 7039v10 The ARM Architecture

    ARM Thumb internetworking

    The method of linking ARM and Thumb code together for both

    assembly and C/C++.

    It handles the transition between the two states. Extra code, called a

    veneer, is sometimes needed to carry out the transition.

    ATPCS defines the ARM and Thumb procedure call standards.

    To call a Thumb routine from an ARM routine, the core has to change

    state of T bit of the CPSR.

    The BX and BLX branch instructions cause a switch between ARM and

    Thumb state while branching to a routine

    MIX Mode Programming

  • 5/21/2018 ARM-final

    71/103

    71TM 7139v10 The ARM Architecture

    branch instructions

    There are two versions of the BX or BLX instructions: an ARM

    instruction and a Thumb equivalent.

    The ARM BX instruction enters Thumb state only if bit 0 of the address

    in Rn is set to binary 1; otherwise it enters ARM state.

    The Thumb BX instruction does the same.

    Syntax: BX Rm BLX Rm | label

    Unlike the ARM version, the Thumb BX instruction cannot be

    conditionally executed.

    The conditional branch instruction is the only conditionally executed

    instruction in Thumb state.

    B branch BL branch with link

    l r =(instruction address after the BL) + 1

    MIX Mode Programming

  • 5/21/2018 ARM-final

    72/103

    72TM 7239v10 The ARM Architecture

    MIX Mode Programming

    The Thumb data processing instructions are a subset of the ARM

    data processing instructions.

    Most Thumb data processing instructions operate on low registers

    and update the cpsr. The exceptions are

    MOV Rd,Rn ADD Rd,Rm CMP Rn,Rm

    ADD sp, #immediate SUB sp, #immediate

    ADD Rd,sp,#immediate ADD Rd,pc,#immediate

    which can operate on the higher registers r8r14 and the pc.

    These instructions, except for CMP, do not update the condition flags

    in the cpsr when using the higher registers.

    The CMP instruction, however, always updates the cpsr.

    Single Register Load StoreI t ti

  • 5/21/2018 ARM-final

    73/103

    73TM 7339v10 The ARM Architecture

    Instructions

    The Thumb instruction set supports load and storing registers, or LDR and

    STR.

    These instructions use two pre-indexed addressing modes: offset by register

    and offset by immediate.

    Load/store register [Rn, Rm]

    Base register + offset [Rn, #immediate]

    Relative [pc|sp, #immediate]

    The offset by register uses a base register Rn + the register offset Rm.

    The second uses the same base register Rn + a 5-bit immediate or a valuedependent on the data size.

    The 5-bit offset encoded in the instruction is multiplied by one for byte

    accesses, two for 16-bit accesses, and four for 32-bit accesses.

    Multiple Register Load-Store I

  • 5/21/2018 ARM-final

    74/103

    74TM 7439v10 The ARM Architecture

    Multiple Register Load Store I

    The Thumb versions of the load-store multiple instructions are reduced

    forms of the ARM load-store multiple instructions. They only support the increment after (IA) addressing mode.

    Syntax : IA Rn!, {low Register list}

    LDMIA load multiple registers

    {Rd}*N mem32[Rn + 4 N], Rn = Rn + 4 N

    Here N is the number of registers in the list of registers. these instructions always update the base register Rn after execution.

    The base register and list of registers are limited to the low registers r0

    to r7.

    MIX Mode Programmingt k i t ti

  • 5/21/2018 ARM-final

    75/103

    75TM 7539v10 The ARM Architecture

    stack instructions The Thumb stack operations are different from the equivalent ARM

    instructions because they use the more traditional POP and PUSH

    concept.

    Syntax: POP {low_register_list{, pc}}

    PUSH {low_register_list{, lr}}

    POP pop registers from the stacks RdN mem32[sp+4 N], sp =

    sp4 N

    MIX Mode Programming

  • 5/21/2018 ARM-final

    76/103

    76TM 7639v10 The ARM Architecture

    MIX Mode Programming

    No stack pointer in the instruction because the stack pointer

    is fixed as register r13 in Thumb operations and sp isautomatically updated.

    The list of registers is limited to the low registers r0 to r7.

    The PUSH register list also can include the link register lr.

    similarly the POP register list can include the pc.

    This provides support for subroutine entry and exit.

    The stack instructions only support full descending stackoperations.

    MIX Mode Programmingft i t t

  • 5/21/2018 ARM-final

    77/103

    77TM 7739v10 The ARM Architecture

    software interrupt.

    Similar to the ARM equivalent, the Thumb software interrupt (SWI)

    instruction causes a software interrupt exception.

    If any interrupt or exception flag is raised in Thumb state, the

    processor automatically reverts back to ARM state to handle the

    exception.

    Syntax: SWI immediate

    The Thumb SWI instruction has the same effect and nearly the same

    syntax as the ARM equivalent.

    It differs in that the SWI number is limited to the range 0 to 255 and it

    is not conditionally executed.

    ARM7 Family

  • 5/21/2018 ARM-final

    78/103

    78TM 7839v10 The ARM Architecture

    y

    One significant variation in the ARM7 family is the ARM7TDMI-S.synthesizable.

    ARM720Tincludes an MMU being capable of handling the Linux and

    Microsoft embedded platform operating systems.

    The processor also includes a unified 8K cache. The vector table canbe relocated to a higher address by setting a coprocessor 15 register.

    Another variation is the ARM7EJ-S processor, also synthesizable,

    provides both Java acceleration and the enhanced instructions but

    without any memory protection ARM7EJ-S is quite differentsince it includes a five-stage pipeline

    and executes ARMv5TEJ instructions.

    ARM PROCESSOR VARIANTS

  • 5/21/2018 ARM-final

    79/103

    79TM 7939v10 The ARM Architecture

    ARM7,ARM9,ARM11

  • 5/21/2018 ARM-final

    80/103

    80TM 8039v10 The ARM Architecture

    , ,

    ARM7, ARM9, ARM10 and ARM11 cores are directly dependent upon

    the type and geometry of the manufacturing process, which has a direct effect

    on the frequency (MHz) and powerconsumption (watts).

    An ARM processor is an implementation of a specific instruction set

    architecture (ISA).

    The ISA has been continuously improved from the first ARM processor design.

    Processors are grouped into implementation families (ARM7, ARM9, ARM10,

    and ARM11) with similar characteristics.

    ARM9 Family

  • 5/21/2018 ARM-final

    81/103

    81TM 8139v10 The ARM Architecture

    y

    The ARM9family was announced in 1997,with five-stage

    pipeline, can run at higher clock frequencies than the ARM7.

    The extra stages improve the overall performance of the

    processor.

    The memory system has been redesigned to follow the Harvard

    architecture, which separates the data D and instruct ion I

    buses.

    The first processor in the ARM9 family was the ARM920T, which

    includes a separate D + I cache and an MMU for virtual memory

    support.

    ARM922T is a variation on the ARM920T but with half the D +I

    cache size.

    ARM9 Family

  • 5/21/2018 ARM-final

    82/103

    82TM 8239v10 The ARM Architecture

    y

    The ARM940Tincludes a smaller D +I cache and an MPU designed

    for applications that do not require a platform operating system.

    Both ARM920T and ARM940T execute the architecture v4T

    instructions.

    The next processors are based on the ARM9E-S core, a synthesizable

    version of the ARM9 core with the E extensions.

    There are two variations: the ARM946E-Sand the ARM966E-S. Both

    execute architecture v5TEinstructions.

    They also support the optional embedded trace macro -cel l (ETM),

    which allows a developer to trace instruction and data execution in realtime on the processor.

    This is important when debugging applications with time-critical

    segments.

    ARM9 Family

  • 5/21/2018 ARM-final

    83/103

    83TM 8339v10 The ARM Architecture

    y

    The ARM946E-Sincludes TCM, cache, and an MPU. The sizes of the TCM

    and caches are configurable.

    This processor is designed for use in embedded control applications that

    require deterministic real-time response.

    In contrast, the ARM966Edoes not have the MPU and cache extensions but

    does have configurable TCMs. The latest core in the ARM9 product line is the ARM926EJ-S synthesizable

    processor core, announced in 2000 the first ARM processor core to include

    the Jazelletechnology, which accelerates Java byte-code execution.

    It is designed for use in small portable Java-enabled devices such as 3G

    phones and personal digital assistants (PDAs).

    It features an MMU, configurable TCMs, and D +I caches with zero or

    nonzero wait state memories.

    ARM10 Family

  • 5/21/2018 ARM-final

    84/103

    84TM 8439v10 The ARM Architecture

    y

    The ARM10, announced in 1999, was designed for

    performance.

    It extends the ARM9 pipeline to six stages.

    It also supports an optional vector floating-point (VFP) unit,which

    adds a seventh stage to the ARM10 pipeline. The VFP

    significantly increases

    floating-point performance and is compliant with the IEEE

    754.1985 floating-point standard.

    ARM10 Family

  • 5/21/2018 ARM-final

    85/103

    85TM 8539v10 The ARM Architecture

    y

    The ARM1020Eis the first processor to use an ARM10Ecore.

    Like the ARM9E,

    It includes the enhanced E instructions. It has separate 32K D

    + I caches, opt ion al vector floating-point unit, and an MMU.

    The ARM1020E also has a dual 64-bit bus interface for

    increased performance.

    ARM1026EJ-S is very similar to the ARM926EJ-S but with both

    MPU and MMU. This processor has the performance of the ARM10 with the

    flexibility of an ARM926EJ-S.

    ARM11 Family

  • 5/21/2018 ARM-final

    86/103

    86TM 8639v10 The ARM Architecture

    y

    The ARM1136J-S, announced in 2003, was designed for highperformance and power-efficient applications.

    It was the first processor implementation to execute architecture

    ARMv6 instructions.

    It incorporates an eight-stage pipeline with separate load-storeand arithmetic pipelines.

    Included in the ARMv6 instructions are single instruction multiple

    data (SIMD) extensions for media processing, specifically

    designed to increase video processing performance.

    The ARM1136JF-Sis an ARM1136J-S with the addition of the

    vector floating-point unit for fast floating-point operations.

    Enhanced DSP features

  • 5/21/2018 ARM-final

    87/103

    87TM 8739v10 The ARM Architecture

    Processing digitized signals requires high memory bandwidths and fast

    multiply accumulate operations.

    A single-core design can reduce cost and power consumption over a

    two-core solution.

    DSP applications are typically multiply and load-store intensive.

    A basic operation is a multiply accumulate multiplying two 16-bitsigned numbers and accumulating onto a 32-bit signed

    accumulator.

    The ARMv5TE extensions available in the ARM9E and later cores

    provide efficient multiply accumulate operations.

    With careful coding, the ARM9E processor will perform decently on

    the DSP parts of an application while

    outperforming a DSPon the control parts of the application.

    Generations suitable for DSPapplications

  • 5/21/2018 ARM-final

    88/103

    88TM 8839v10 The ARM Architecture

    applications.

    DSP Algorithms Characteristics

  • 5/21/2018 ARM-final

    89/103

    89TM 8939v10 The ARM Architecture

    Due to their high data bandwidth and performance requirements,

    we have to code DSP algorithms in hand-written assembly.

    We need fine control of register allocation and instruction

    scheduling to achieve the best performance.

    Filtering is probably the most commonly used signal processing

    operation. It can be used to remove noise, to analyze signals, or in

    signal compression.

    Another very common algorithm is the Discrete Fourier Transform

    (DFT), which converts a signal from a time representation to a

    frequency representation or vice versa.

    How to represent a signal on theARM

  • 5/21/2018 ARM-final

    90/103

    90TM 9039v10 The ARM Architecture

    ARM

    Use a floating-point representation for prototyping algorithms. Do notuse floating point in applications where speed is critical. Most ARM

    implementations do not include hardware floating-point support.

    Use a fixed-point representation for DSP applications where speed

    is critical with moderate dynamic range. The ARM cores provide good

    support for 8-, 16- and 32-bit fixed-point DSP.

    For applications requiring speed and high dynamic range, use a block-

    floating or logarithmic representation.

    The key idea is to use block algorithms that calculate several results

    at once, and thus require less memory bandwidth, increaseperformance and decrease power consumption compared with

    calculating single results.

  • 5/21/2018 ARM-final

    91/103

    91TM 9139v10 The ARM Architecture

    Figure shows a sine wave signal digitized at the sampling points 0, 1, 2,

    3, and so on.

    Dynamic Range &Accuracy

  • 5/21/2018 ARM-final

    92/103

    92TM 9239v10 The ARM Architecture

    There are two things to worry about when choosing a representation ofx[t

    ]:

    1. The dynamic range of the signalthe maximum fluctuation in the

    signal defined by Equation-A.

    For a signed signal we are interested in the maximum absolute value M

    possible. For this example, letstake M = 1 volt.

    M = max|x[t ]| over all t = 0, 1, 2, 3 . . (A)

    2. The accuracy required in the representation- sometimes given as

    a proportion of the maximum range.

    For example, an accuracy of 100 parts per million means that each x[t ]

    needs to be represented within an error of

    E = M 0. 0001 = 0. 0001 volts

    Suitable Representation

  • 5/21/2018 ARM-final

    93/103

    93TM 9339v10 The ARM Architecture

    We could use a floating-point representation for x[t ].

    1)This would certainly meet our dynamic range and accuracy

    constraints, and

    2) it would also be easy to manipulate using the C type float.

    However, most ARM cores do not support floating point in

    hardware, and so a floating-point representation would be very slow.

    A better choice for fast code is a fixed-point representation.

    A fixed-point representation uses an integer to represent a fractional

    value by scaling the fraction.

    Error Vs Accuracy

  • 5/21/2018 ARM-final

    94/103

    94TM 9439v10 The ARM Architecture

    A common error is to think that floating point is more accurate than fixed point. Thisis false!

    For the same number of bits, a fixed-point representation gives greater accuracy.

    The floating-point representation gives higher dynamic range at the expense of

    lower absolute accuracy.

    For example, if you use a 32-bit integer to hold a fixed-point value scaled to full

    range, then the maximum error in a representation is 232. However, single-

    precision 32-bit floating-point values give a relative error of 224.

    The single-precision floating-point mantissa is 24 bits. The leading 1 of the mantissa

    is not stored, so 23 bits of storage are actually used. For values near the maximum,

    the fixed-point representation is 23224 = 256 times more accurate!

    The 8-bit floating-point exponent is of little use when you are interested in maximum

    error rather than relative accuracy.

    Better representation

  • 5/21/2018 ARM-final

    95/103

    95TM 9539v10 The ARM Architecture

    To summarize, a fixed-point representation is best when there is aclear bound to the strength of the signal and when maximum error is

    important.

    When there is no clear bound and you require a large dynamic range,then floating point is better.

    You can also use the other representations, which give more dynamic

    range than fixed point while still being more efficient to implement than

    floating point.

    General rules on writing DSPalgorithms for the ARM

  • 5/21/2018 ARM-final

    96/103

    96TM 9639v10 The ARM Architecture

    algorithms for the ARM.

    ARM does not provide operations that saturate automatically. Design the DSP

    algorithm so that saturation is not required because saturation will cost extracycles.

    ARM supports extended-precision 32-bit multiplied by 32-bit to 64-bit

    operations very well. Use extended-precision arithmetic or additional scaling

    rather than saturation.

    The ARM core is not a dedicated DSP. There is no single instruction thatissues a multiply accumulate and data fetch in parallel. However, by reusing

    loaded data you can achieve a respectable DSP performance.

    Design the DSP algorithm to minimize loads and stores. Once you load a data

    item, then perform as many operations that use the datum as possible. You

    can often do this by calculating several output results at once. Another way of increasing reuse is to concatenate several operations. For

    example, you could perform a dot product and signal scale at the same time,

    while only loading the data once.

    Guidelines for writing DSP code.

  • 5/21/2018 ARM-final

    97/103

    97TM 9739v10 The ARM Architecture

    FromARM9onwards,ARMimplementations use a multistage execute pipeline

    for loads and multiplies, which introduces potential processor interlocks. If you load a value and then use it in either of the following two instructions, the

    processor may stall for a number of cycles waiting for the loaded value to

    arrive.

    Similarly if you use the result of a multiply in the following instruction, this may

    cause stall cycles. It is particularly important to schedule code to avoid thesestalls.

    Write ARM assembly to avoid processor interlocks. The results of load and

    multiply instructions are often not available to the next instruction without

    adding stall cycles.

    Sometimes the results will not be available for several cycles

    There are 14 registers available for general use on the ARM, r0 to r12 and r14.

    Design the DSP algorithm so that the inner loop will require 14 registers or

    fewer.

    An example- a DOT Product

  • 5/21/2018 ARM-final

    98/103

    98TM 9839v10 The ARM Architecture

    A dot-product is one of the simplest DSP operations and highlights the

    difference among different ARM implementations.

    A dot-product combines N samples from two signals x(t) and c(t) to

    produce a correlation value a: a= Ci * Xi

    The C interface to the dot-product function is

    int dot_product(sample *x, coefficient *c, unsigned int N); where

    sample is the type to hold a 16-bit audio sample, usually a short

    coefficient is the type to hold a 16-bit coefficient, usually a short

    x[i] and c[i] are two arrays of length N (the data and coefficients)

    the function returns the accumulated 32-bit integer dot product a

    DSP rating of ARM7TDMI

  • 5/21/2018 ARM-final

    99/103

    99TM 9939v10 The ARM Architecture

    This example shows a 16-bit dot-product optimized for the ARM7TDMI.Each MLA takes a worst case of four cycles.

    We store the 16-bit input samples in 32-bit words so that we can use the

    LDM instruction to load them efficiently.

    This code assumes that the number of samples N is a multiple of five.

    Therefore we can use a five-word load multiple to increase data bandwidth.

    The cost per load is 7/4 = 1.4 cycles compared to 3 cycles per load if we had

    used LDR or LDRSH.

    The inner loop requires a worst case of 7 + 7 + 5 4 + 1 + 3 = 38 cycles to

    process each block of 5 products from the sum.

    This gives the ARM7TDMI a DSP rating of 38/5 = 7.6 cycles per tapfor a

    16-bit dot-product.

    Assembly code for DOT Product

  • 5/21/2018 ARM-final

    100/103

    100TM 10039v10 The ARM Architecture

    x RN 0 ; input array x[]

    c RN 1 ; input array c[]

    N RN 2 ; number of samples (a multiple

    of 5)

    acc RN 3 ; accumulator

    x_0 RN 4 ; elements from array x[]

    x_1 RN 5

    x_2 RN 6

    x_3 RN 7

    x_4 RN 8

    c_0 RN 9 ; elements from array c[]

    c_1 RN 10

    c_2 RN 11

    c_3 RN 12

    c_4 RN 14

    ; int dot_16by16_arm7m(int *x, int *c, unsigned N)

    dot_16by16_arm7m

    STMFD sp!, {r4-r11, lr}

    MOV acc, #0

    loop_7m ; accumulate 5 products

    LDMIA x!, {x_0, x_1, x_2, x_3, x_4}

    LDMIA c!, {c_0, c_1, c_2, c_3, c_4}

    MLA acc, x_0, c_0, acc

    MLA acc, x_1, c_1, acc

    MLA acc, x_2, c_2, acc

    MLA acc, x_3, c_3, acc

    MLA acc, x_4, c_4, acc

    SUBS N, N, #5

    BGT loop_7m

    MOV r0, acc

    LDMFD sp!, {r4-r11, pc}

    DSP Rating -16bit DOT Product

  • 5/21/2018 ARM-final

    101/103

    101TM 10139v10 The ARM Architecture

    ARM7TDMI a DSP rating of 38/5 = 7.6 cycles per tap

    ARM9TDMI-The inner loop requires 28 cycles per tap, giving 28/4 = 7

    cycles per tap.

    STRONGARM-The inner loop uses 19 cycles to process 4 taps, giving

    a rating of 19/4 = 4.75 cycles per tap.

    ARM9E-The inner loop requires 20 cycles to accumulate 8 products, a

    rating of 20/8 = 2.5 cycles per tap.

    ARM10E-The inner loop requires 25 cycles to process 10 samples, or

    2.5 cycles per tap.

    Intel- XSCALE-The inner loop requires 14 cycles to accumulate 8

    products, a rating of 1.75 cycles per tap.

    Performance improvement in DSP

  • 5/21/2018 ARM-final

    102/103

    102TM 10239v10 The ARM Architecture

    The block

    filter

    algorithm

    gives a

    much betterperformance

    per tap if you

    are

    calculating

    multipleproducts.

  • 5/21/2018 ARM-final

    103/103

    THANK YOU