ARM-final

5/21/2018 ARM-final

1/103

1TM 139v10 The ARM Architecture

ARM PROCESSOR,

ESRTP,

SEM-8,B.E.(ELECTRONICS)

WITH BEST OF LUCK FROM:

Prof. Vidya Gogate

SAKEC, Chembur

5/21/2018 ARM-final

2/103

2TMT H E A R C H I T E C T U R E F O R T H E D I G I T A L W O R L D

The ARM Architecture

5/21/2018 ARM-final

3/103


Agenda

Introduction to ARM Ltd

Programmers Model

Instruction Set

System Design

Development Tools

5/21/2018 ARM-final

4/103


ARM Ltd

Founded in November 1990

Spun out of Acorn Computers

Designs the ARM range of RISC processor

cores

Licenses ARM core designs to semiconductor

partners who fabricate and sell to their

customers.

ARM does not fabricate silicon itself

Also develop technologies to assist with thedesign-in of the ARM architecture

Software tools, boards, debug hardware,

application software, bus architectures,

peripherals etc

5/21/2018 ARM-final

5/1035TM 539v10 The ARM Architecture

ARM Partnership Model

5/21/2018 ARM-final


ARM Powered Products

5/21/2018 ARM-final


Latest NEWS

For 30 years, Intel basically had the market to itself and as a result, its

chips were priced much higher than ARM chips.

But ARMs entry into Intels territory is a real threat to Intels dominance.

In fact, Microsofts first version of the Surface tablet/keyboard device

uses ARM chips. The Intel-based versions will come out about three

month later.
http://techland.time.com/2012/06/18/surface-microsoft-made-windows-8-tablets-coming-this-year/http://techland.time.com/2012/06/18/surface-microsoft-made-windows-8-tablets-coming-this-year/

5/21/2018 ARM-final


Latest NEWS

There has been an important new development in the processor world

lately. The folks behind the ARM processor the chip that powers most

smart phones and tablets today decided to scale up this processor

technology to run at speeds that could be used in advanced tablets andmore importantly, laptops and even desktops.

And ARM processors got a major boost when Microsoft made the

decision to create an ARM-based version of Windows 8. For the first time,

Microsoft broke away from whats been called the Win-Tel monopoly.

5/21/2018 ARM-final


Latest NEWS

But Microsofts new operating system for ARM, called Windows RT, has

touched off a new battlefront in processor wars.

This has forced Intel to try and make its x86 processors more energyefficient;

the company hopes to have chips that are on par with ARM by mid-2013.

Read more: http://techland.time.com/2012/07/16/arm-vs-intel-how-the-

processor-wars-will-benefit-consumers-most/#ixzz23XHpDxkn
http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/

5/21/2018 ARM-final


RISC Design Philosophy

InstructionsRISC processors have a reduced number of instruction

classes which provide simple operations that can each execute in a

single cycle.

In contrast, in CISC processors the instructions are often of variable

size and take many cycles to execute.

PipelinesThe processing of instructions is broken down into smallerunits that can be executed in parallel by pipelines.

There is no need for an instruction to be executed by a mini-program

called microcode as on CISC processors.

RegistersRISC machines have a large general-purpose register set.Any register can contain either data or an address. Registers act as the

fast local memory store for all data processing operations.

In contrast, CISC processors have dedicated registers for specific

purposes.

5/21/2018 ARM-final



Load-store architectureThe processor operates on

data held in registers. Separate load and store instructions

transfer data between the register bank and external

memory.

Memory accesses are costly, so separating memory

accesses from data processing provides an advantage

because you can use data items held in the register bank

multiple times without needing multiple memory accesses.

In contrast, with a CISC design the data processing

operations can act on memory directly.

5/21/2018 ARM-final



These design rules allow a RISC processor to be simpler,

and thus the core can operate at higher clock

frequencies.

In contrast, traditional CISC processors are more complexand operate at lower clock frequencies.

5/21/2018 ARM-final


ARM Design Philosophy

Portable embedded systems require some form of battery power. The

ARM processor has been specifically designed to be small to reduce

power consumption and extend battery operationessential for

applications such as PDAs.

Since embedded systems have limited memory due to cost and/orphysical size restrictions; High code density is useful feature of ARM

for applications that have limited on-board memory. The ability to use

low-cost memory devices produces substantial savings.

For a single-chip solution, the smaller the area used by the embedded

processor,(reduced die size) the more available space for specializedperipherals. This in turn reduces the cost of the design and

manufacturing since fewer discrete chips are required for the end

product.

5/21/2018 ARM-final


ARM Design Philosophy

ARM has incorporated hardware debug technology

within the processor so that software engineers can view

what is happening while the processor is executing code.

With greater visibility, software engineers can resolve

issues faster, which has a direct effect on the time to

market and reduces overall development costs.

5/21/2018 ARM-final


Instruction set FEATURES

Variable cycle execution for certain instructions-

load-store-multiple instructions vary in the number of execution cycles

depending upon the number of registers being transferred.

The transfer can occur on sequential memory addresses, which

increases performance since sequential memory accesses are often

faster than random accesses.

Code density is also improved since multiple register transfers are

common operations at the start and end of functions.

Inline barrel shifter leading to more complex instructions

The inline barrel shifter is a hardware component that preprocesses

one of the input registers before it is used by an instruction.

This expands the capability of many instructions to improve core

performance and code density.

5/21/2018 ARM-final


Instruction set FEATURES

Thumb 16-bit instruction setThe 16-bit instructions improvecode density by about 30% over 32-bit fixed-length instructions.

Conditional execution- An instruction is only executed when a

specific condition has been satisfied. This feature improvesperformance and code density by reducing branch instructions.

Enhanced instructions - The enhanced digital signal processor

(DSP) instructions were added to the standard ARM instruction set tosupport fast 1616-bit multiplier operations and saturation. These

instructions allow a faster-performing ARM processor in some cases to

replace the traditional combinations of a processor plus a DSP.

5/21/2018 ARM-final


Nomenclature.

Instruction set architecture (ISA) is upward compatible.

ARM{x}{y}{z}{T}{D}{M}{I}{E}{J}{F}{-S}

xfamily

ymemory management/protection unit

zcache

TThumb 16-bit decoder

DJTAG debug

Mfast multiplier

IEmbedded ICE macro-cell

Eenhanced instructions (assumes TDMI)

JJazelle

Fvector floating-point unit

Ssynthesizible version

5/21/2018 ARM-final


Nomenclature

All ARM cores after the ARM7TDMI include the TDMIfeatures even though they may not include those letters

after the ARM label.

The processor family is a group of processor

implementations that share the same hardware

characteristics.

For example, the ARM7TDMI, ARM740T, and ARM720T all

share the same family characteristics and belong to the

ARM7 family.

5/21/2018 ARM-final


Nomenclature

JTAG is described by IEEE 1149.1 Standard Test Access Portand boundary scan architecture.

It is a serial protocol used by ARM to send and receive

debug information between the processor core and test

equipment.

Embedded ICE macro-cell is the debug hardware built into

the processor that allows breakpoints and watch-points to

be set.

Synthesizable means that the processor core is supplied as

source code that can be compiled into a form easily used

by EDA tools.

5/21/2018 ARM-final


Syllabus

ARM processor fundamentalsintroduction to ARM and THUMBinstruction set--

processor and memory organizationCPU Bus configuration

ARM BusMemory devicesInput/output devices

Component interfacingdesigning with microprocessordevelopment and debugging

Design Example

Instruction set with enhanced DSP features with ARM core, mix

mode programming as Thumb + ARM core,

Assembly programming concept, compare with ARM7, ARM9,

ARM11 with new features additions

5/21/2018 ARM-final


PIN DIAGRAM

5/21/2018 ARM-final


Architecture block diagram

5/21/2018 ARM-final


Hardware Fundamentals

The ARM processor can be abstracted

into eight components

ALU, barrel shifter, MAC, register file,

instruction decoder, address register,

incrementer, and sign extend.

5/21/2018 ARM-final

24/103


Data Sizes and Instruction Sets

The ARM is a 32-bit architecture.

When used in relation to the ARM:

Bytemeans 8 bits

Halfwordmeans 16 bits (two bytes) Wordmeans 32 bits (four bytes)

Most ARMs implement two instruction sets

32-bit ARM Instruction Set

16-bit Thumb Instruction Set

Jazelle cores can also execute Java bytecode

5/21/2018 ARM-final

25/103


Processor Modes

The ARM has seven basic operating modes:

User: unprivileged mode under which most tasks run

FIQ: entered when a high priority (fast) interrupt is raised

IRQ: entered when a low priority (normal) interrupt is raised

Supervisor: entered on reset and when a Software Interrupt

instruction is executed

Abort: used to handle memory access violations

Undef: used to handle undefined instructions

System: privileged mode using the same registers as user mode

5/21/2018 ARM-final

26/103


r0

r1

r2

r3

r4

r5

r6

r7

r8

r9

r10

r11

r12

r13 (sp)

r14 (lr)

r15 (pc)

cpsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r8

r9

r10

r11

r12

r13 (sp)

r14 (lr)

spsr

FIQ IRQ SVC Undef Abort

User Moder0

r1

r2

r3

r4

r5

r6

r7

r8

r9

r10

r11

r12

r13 (sp)

r14 (lr)

r15 (pc)

cpsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r8

r9

r10

r11

r12

r13 (sp)

r14 (lr)

spsr

Current Visible Registers

Banked out Registers

FIQ IRQ SVC Undef Abort

r0

r1

r2

r3

r4

r5

r6

r7

r15 (pc)

cpsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r8

r9

r10

r11

r12

r13 (sp)

r14 (lr)

spsr



User IRQ SVC Undef Abort

r8

r9

r10

r11

r12

r13 (sp)

r14 (lr)

FIQ ModeIRQ Moder0

r1

r2

r3

r4

r5

r6

r7

r8

r9

r10

r11

r12

r15 (pc)

cpsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r8

r9

r10

r11

r12

r13 (sp)

r14 (lr)

spsr



User FIQ SVC Undef Abort

r13 (sp)

r14 (lr)

Undef Moder0

r1

r2

r3

r4

r5

r6

r7

r8

r9

r10

r11

r12

r15 (pc)

cpsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r8

r9

r10

r11

r12

r13 (sp)

r14 (lr)

spsr



User FIQ IRQ SVC Abort

r13 (sp)

r14 (lr)

SVC Moder0

r1

r2

r3

r4

r5

r6

r7

r8

r9

r10

r11

r12

r15 (pc)

cpsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r8

r9

r10

r11

r12

r13 (sp)

r14 (lr)

spsr



User FIQ IRQ Undef Abort

r13 (sp)

r14 (lr)

Abort Moder0

r1

r2

r3

r4

r5

r6

r7

r8

r9

r10

r11

r12

r15 (pc)

cpsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r13 (sp)

r14 (lr)

spsr

r8

r9

r10

r11

r12

r13 (sp)

r14 (lr)

spsr



User FIQ IRQ SVC Undef

r13 (sp)

r14 (lr)

The ARM Register Set

5/21/2018 ARM-final

27/103


Register Organization Summary

Usermoder0-r7,r15,andcpsr

r8

r9

r10

r11

r12

r13 (sp)

r14 (lr)

spsr

FIQ

r8

r9

r10

r11

r12

r13 (sp)

r14 (lr)r15 (pc)

cpsr

r0r1

r2

r3

r4

r5

r6

r7

User

r13 (sp)

r14 (lr)

spsr

IRQ

Usermoder0-r12,r15,

andcpsr

r13 (sp)

r14 (lr)

spsr

Undef

Usermoder0-r12,r15,

andcpsr

r13 (sp)

r14 (lr)

spsr

SVC

Usermoder0-r12,r15,

andcpsr

r13 (sp)

r14 (lr)

spsr

Abort

Usermoder0-r12,r15,

andcpsr

Thumb state

Low registers

Thumb state

High registers

Note: System mode uses the User mode register set

5/21/2018 ARM-final

28/103


The Registers

ARM has 37 registers all of which are 32-bits long. 1 dedicated program counter

1 dedicated current program status register

5 dedicated saved program status registers

30 general purpose registers

The current processor mode governs which of several banks is

accessible. Each mode can access

a particular set of r0-r12registers

a particular r13(the stack pointer, sp) and r14(the link register,lr)

the program counter,r15 (pc)

the current program status register, cpsr

Privileged modes (except System) can also access

a particular spsr(saved program status register)

5/21/2018 ARM-final

29/103


Program Status Registers

Condition code flags

N =Negative result from ALU

Z = Zero result from ALU

C = ALU operation Carried out

V = ALU operation oVerflowed

Sticky Overflow flag - Q flag

Architecture 5TE/J only

Indicates if saturation has occurred

J bit

Architecture 5TEJ only

J = 1: Processor in Jazelle state

Interrupt Disable bits.

I = 1: Disables the IRQ.

F = 1: Disables the FIQ.

T Bit

Architecture xT only

T = 0: Processor in ARM state

T = 1: Processor in Thumb state

Mode bits Specify the processor mode

2731

N Z C V Q28 67

I F T mode1623 815 5 4 024

f s x c

U n d e f i n e dJ

5/21/2018 ARM-final

30/103


When the processor is executing in ARM state:

All instructions are 32 bits wide

All instructions must be word aligned

Therefore the pcvalue is stored in bits [31:2] with bits [1:0] undefined (as

instruction cannot be halfword or byte aligned).

When the processor is executing in Thumb state:


All instructions must be halfword aligned

Therefore the pcvalue is stored in bits [31:1] with bit [0] undefined (asinstruction cannot be byte aligned).

When the processor is executing in Jazelle state:


Processor performs a word access to read 4 instructions at once

Program Counter (r15)

5/21/2018 ARM-final

31/103


Vector Table

Exception Handling

When an exception occurs, the ARM: Copies CPSR into SPSR_

Sets appropriate CPSR bits

Change to ARM state

Change to exception mode

Disable interrupts (if appropriate)

Stores the return address in LR_

Sets PC to vector address

To return, exception handler needs to:

Restore CPSR from SPSR_

Restore PC from LR_

This can only be done in ARM state.

Vector table can be at

0xFFFF0000on ARM720Tand on ARM9/10 family devices

FIQIRQ

(Reserved)Data Abort

Prefetch Abort

Software Interrupt

Undefined Instruction

Reset

0x1C0x18

0x140x10

0x0C

0x08

0x04

0x00

5/21/2018 ARM-final

32/103


Instruction set

ARM has three instruction setsARM, Thumb, and Jazelle.

The register file contains 37 registers, but only 17 or 18 registers

are accessible at any point in time; the rest are banked

according to processor mode.

The current processor mode is stored in the CPSR

It holds the current status of the processor core as well as

interrupt masks, condition flags, and state.

The state determines which instruction set is being executed.

5/21/2018 ARM-final

33/103


ARM instructions can be made to execute conditionally by postfixing

them with the appropriate condition code field.

This improves code density andperformance by reducing the number of

forward branch instructions.CMP r3,#0 CMP r3,#0BEQ skip ADDNE r0,r1,r2ADD r0,r1,r2

skip

By default, data processing instructions do not affect the condition code

flags but the flags can be optionally set by using S. CMP does notneed S.

loopSUBS r1,r1,#1BNE loop

if Z flag clear then branch

decrement r1 and set flags

Conditional Execution and Flags

5/21/2018 ARM-final

34/103


Condition Codes

Not equal

Unsigned higher or sameUnsigned lower

Minus

Equal

Overflow

No overflow

Unsigned higher

Unsigned lower or same

Positive or Zero

Less than

Greater than

Less than or equal

Always

Greater or equal

EQNE

CS/HSCC/LO

PLVS

HI

LSGELTGTLEAL

MI

VC

Suffix Description

Z=0

C=1C=0

Z=1

Flags tested

N=1

N=0

V=1

V=0

C=1 & Z=0

C=0 or Z=1N=V

N!=V

Z=0 & N=V

Z=1 or N=!V

The possible condition codes are listed below: Note AL is the default and does not need to be specified

Examples of conditional

5/21/2018 ARM-final

35/103


Examples of conditional

execution

Use a sequence of several conditional instructions

if (a==0) func(1);CMP r0,#0MOVEQ r0,#1BLEQ func

Set the flags, then use various condition codes

if (a==0) x=0;if (a>0) x=1;

CMP r0,#0MOVEQ r1,#0MOVGT r1,#1

Use conditional compare instructionsif (a==4 || a==10) x=0;

CMP r0,#4CMPNE r0,#10MOVEQ r1,#0

5/21/2018 ARM-final

36/103


Branch : B{} label

Branch with Link : BL{} subroutine_label

The processor core shifts the offset field left by 2 positions, sign-extends

it and adds it to the PC

32 Mbyte range

How to perform longer branches?

2831 24 0

Cond 1 0 1 L Offset

Condition field

Link bit 0 = Branch1 = Branch with link

232527

Branch instructions

5/21/2018 ARM-final

37/103


Data processing Instructions

Consist of : Arithmetic: ADD ADC SUB SBC RSB RSC Logical: AND ORR EOR BIC Comparisons: CMP CMN TST TEQ Data movement: MOV MVN

These instructions only work on registers, NOT memory.

Syntax:

{}{S} Rd, Rn, Operand2

Comparisons set flags only - they do not specify Rd

Data movement does not specify Rn

Second operand is sent to the ALU via barrel shifter.

5/21/2018 ARM-final

38/103


The Barrel Shifter

DestinationCF 0 Destination CF

LSL : Logical Left Shift ASR: Arithmetic Right Shift

Multiplication by a power of 2 Division by a power of 2,preserving the sign bit

Destination CF...0 Destination CF

LSR : Logical Shift Right ROR: Rotate Right

Division by a power of 2 Bit rotate with wrap aroundfrom LSB to MSB

Destination

RRX: Rotate Right Extended

Single bit rotate with wrap aroundfrom CF to MSB

CF

Using the Barrel Shifter:

5/21/2018 ARM-final

39/103


Register, optionally with shift operation Shift value can be either be:

5 bit unsigned integer

Specified in bottom byte of another

register.

Used for multiplication by constant

Immediate value

8 bit number, with a range of 0-255.

Rotated right through even number of

positions

Allows increased range of 32-bitconstants to be loaded directly into

registers

Result

Operand1

BarrelShifter

Operand2

ALU

Using the Barrel Shifter:

The Second Operand

5/21/2018 ARM-final

40/103


No ARM instruction can contain a 32 bit immediate constant

All ARM instructions are fixed as 32 bits long

The data processing instruction format has 12 bits available for operand2

4 bit rotate value (0-15) is multiplied by two to give range 0-30 in steps of 2

Rule to remember is 8-bits shifted by an even number of bit positions.

0711 8

immed_8

Shifter

ROR

rot

x2

Quick Quiz:

0xe3a004ffMOV r0, #???

Immediate constants (1)

5/21/2018 ARM-final

41/103


Examples:

The assembler converts immediate values to the rotate form: MOV r0,#4096 ; uses 0x40 ror 26 ADD r1,r2,#0xFF0000 ; uses 0xFF ror 16

The bitwise complements can also be formed using MVN: MOV r0, #0xFFFFFFFF ; assembles to MVN r0,#0

Values that cannot be generated in this way will cause an error.

031

ror #0

range 0-0xff000000 step 0x01000000ror #8

range 0-0x000000ff step 0x00000001

range 0-0x000003fc step 0x00000004ror #30

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Immediate constants (2)

5/21/2018 ARM-final

42/103


To allow larger constants to be loaded, the assembler offers a pseudo-instruction: LDR rd, =const

This will either: Produce aMOVorMVNinstruction to generate the value (if possible).

or Generate a LDRinstruction with a PC-relative address to read the constant

from a literal pool(Constant data area embedded in the code).

For example LDR r0,=0xFF => MOV r0,#0xFF LDR r0,=0x55555555 => LDR r0,[PC,#Imm12]

DCD 0x55555555

This is the recommended way of loading constants into a register

Loading 32 bit constants

5/21/2018 ARM-final

43/103


Multiply

Syntax:

MUL{}{S} Rd, Rm, Rs Rd = Rm * Rs MLA{}{S} Rd,Rm,Rs,Rn Rd = (Rm * Rs) + Rn

[U|S]MULL{}{S} RdLo, RdHi, Rm, Rs RdHi,RdLo := Rm*Rs

[U|S]MLAL{}{S} RdLo, RdHi, Rm, Rs RdHi,RdLo := (Rm*Rs)+RdHi,RdLo

Cycle time

Basic MUL instruction

2-5 cycles on ARM7TDMI

1-3 cycles on StrongARM/XScale

2 cycles on ARM9E/ARM102xE

+1 cycle for ARM9TDMI (over ARM7TDMI)

+1 cycle for accumulate (not on 9E though result delay is one cycle longer)

+1 cycle for long

Above are general rules - refer to the TRM for the core you are using

for the exact details

S f

5/21/2018 ARM-final

44/103


Single register data transfer

LDR STR WordLDRB STRB Byte

LDRH STRH HalfwordLDRSB Signed byte load

LDRSH Signed halfword load

Memory system must support all access sizes

Syntax:

LDR

{}{} Rd,

STR{}{} Rd,

e.g. LDREQB

Add d

5/21/2018 ARM-final

45/103


Address accessed

Address accessed by LDR/STR is specified by a base register plus an

offset

For word and unsigned byte accesses, offset can be An unsigned 12-bit immediate value (ie 0 - 4095 bytes).

LDR r0,[r1,#8]

A register, optionally shifted by an immediate valueLDR r0,[r1,r2]LDR r0,[r1,r2,LSL#2]

This can be either added or subtracted from the base register:LDR r0,[r1,#-8]LDR r0,[r1,-r2]LDR r0,[r1,-r2,LSL#2]

For halfword and signed halfword / byte, offset can be: An unsigned 8 bit immediate value (ie 0-255 bytes).

A register (unshifted).

Choice of pre-indexedor post- indexedaddressing

P P t I d d Add i ?

5/21/2018 ARM-final

46/103


0x5

0x5

r1

0x200Base

Register 0x200

r00x5

SourceRegisterfor STR

Offset12 0x20c

r1

0x200

OriginalBase

Register0x200

r0

0x5Source

Registerfor STR

Offset

12 0x20c

r1

0x20c

UpdatedBase

Register

Auto-update form:STR r0,[r1,#12]!

Pre or Post Indexed Addressing?

Pre-indexed: STR r0,[r1,#12]

Post-indexed: STR r0,[r1],#12

LDM / STM ti

5/21/2018 ARM-final

47/103


LDM / STM operation

Syntax:

{} Rb{!},

4 addressing modes:

LDMIA/ STMIA increment after

LDMIB/ STMIB increment before

LDMDA/ STMDA decrement after

LDMDB/ STMDB decrement beforeIA

r1 Increasing

Address

r4

r0

r1

r4

r0

r1

r4

r0 r1

r4

r0

r10

IB DA DB

LDMxx r10, {r0,r1,r4}STMxx r10, {r0,r1,r4}

Base Register (Rb)

S ft I t t (SWI)

5/21/2018 ARM-final

48/103


Software Interrupt (SWI)

Causes an exception trap to the SWI hardware vector The SWI handler can examine the SWI number to decide what operation

has been requested.

By using the SWI mechanism, an operating system can implement a set

of privileged operations which applications running in user mode can

request. Syntax:

SWI{}

2831 2427 0

Cond 1 1 1 1 SWI number (ignored by processor)

23

Condition Field

PSR T f I t ti

5/21/2018 ARM-final

49/103


PSR Transfer Instructions

MRS and MSR allow contents of CPSR / SPSR to be transferred to / from

a general purpose register.

Syntax:

MRS{} Rd, ; Rd = MSR{} ,Rm ; = Rm

where = CPSR or SPSR

[_fields] = any combination of fsxc

Also an immediate form MSR{} ,#Immediate

In User Mode, all bits can be read but only the condition flags (_f) can be

written.

2731

N Z C V Q28 67

I F T mode1623 815 5 4 024

f s x c

U n d e f i n e dJ

ARM B h d S b ti

5/21/2018 ARM-final

50/103


ARM Branches and Subroutines

B

PC relative. 32 Mbyte range.

BL

Stores return address in LR

Returning implemented by restoring the PC from LR

For non-leaf functions, LR will have to be stacked

STMFDsp!,{regs,lr}

:

BL func2

:

LDMFDsp!,{regs,pc}

func1 func2

:

:

BL func1

:

:

:

:

:

::

MOV pc, lr

Th b

5/21/2018 ARM-final

51/103


Thumb

Thumb is a 16-bit instruction set

Optimised for code density from C code (~65% of ARM code size)

Improved performance from narrow memory

Subset of the functionality of the ARM instruction set

Core has additional execution state - Thumb

Switch between ARM and Thumb using BXinstruction

015

31 0

ADDS r2,r2,#1

ADD r2,#1

32-bit ARM Instruction

16-bit Thumb Instruction

For most instructions generated by compiler:

Conditional execution is not used

Source and destination registers identical

Only Low registers used Constants are of limited size

Inline barrel shifter not used

A d

5/21/2018 ARM-final

52/103


Agenda

Introduction

Programmers Model

Instruction Sets

System Design

Development Tools

E l ARM b d S t

5/21/2018 ARM-final

53/103


Example ARM-based System

16 bit RAM

8 bit ROM

32 bit RAM

ARMCore

I/OPeripherals

Interrupt

ControllernFIQnIRQ

ARM Based microcontroller

5/21/2018 ARM-final

54/103


ARM Based microcontroller

A Basic ARM MEMORY SYSTEM

5/21/2018 ARM-final

55/103


A Basic ARM MEMORY SYSTEM.

AMBA

5/21/2018 ARM-final

56/103


AMBA

Bridge

Timer

On-chipRAM

ARM

InterruptController

Remap/Pause

TIC

Arbiter

Bus InterfaceExternalROM

ExternalRAM

Reset

System Bus Peripheral Bus

AMBA

Advanced Microcontroller Bus

Architecture

ADK

Complete AMBA Design Kit

ACT

AMBA Compliance Testbench

PrimeCell

ARMs AMBA compliant peripherals

AHB or ASB APB

ExternalBus

Interface

Decoder

System Design Hardware

5/21/2018 ARM-final

57/103


System Design-Hardware

An embedded system includes the following hardwarecomponents:

ARMprocessors are found embedded in chips.

Programmers accessperipherals through memory-mappedregisters.

There is a special type of peripheral called a controller,

which embedded systems use to configure higher-level

functions such as memory and interrupts. The AMBA on-chip bus is used to connect the processor

and peripherals together.

System Design Software

5/21/2018 ARM-final

58/103


System Design-Software

An embedded system also includes the following softwarecomponents:

Init ial ization co deconfigures the hardware to a known state.

Once configured, operat ing sys tems can be loaded and executed.

Operating systems provide a common programming environment for

the use of hardware resources and infrastructure.

Device d r iversprovide a standard interface to peripherals.

An appl icat ion Program performs the task-specific duties of anembedded system.

Agenda

5/21/2018 ARM-final

59/103


Agenda

Introduction

Programmers Model

Instruction Sets

System Design

Development Tools

The RealView Product Families

5/21/2018 ARM-final

60/103


The RealView Product Families

Debug ToolsAXD (part of ADS)

Trace Debug Tools

Multi-ICE

Multi-Trace

PlatformsARMulator (part of ADS)

Integrator Family

Compilation ToolsARM Developer Suite (ADS)

Compilers (C/C++ ARM & Thumb),

Linker & Utilities

RealView Compilation Tools (RVCT) RealView Debugger (RVD)

RealView ICE (RVI)

RealView Trace (RVT)

RealView ARMulator ISS (RVISS)

ARM Debug Architecture

5/21/2018 ARM-final

61/103


ARM Debug Architecture

ARM

core

ETM

TAP

controller

Trace PortJTAG port

Ethernet

Debugger (+ optional

trace tools)

EmbeddedICE Logic

Provides breakpoints and processor/system

access

JTAG interface (ICE)

Converts debugger commands to JTAG

signals

Embedded trace Macrocell (ETM) Compresses real-time instruction and data

access trace

Contains ICE features (trigger & filter logic)

Trace port analyzer (TPA)

Captures trace in a deep buffer

EmbeddedICE

Logic

5/21/2018 ARM-final

62/103

Thumb instruction set

5/21/2018 ARM-final

63/103


Thumb instruction set

Thumb instruction set encodes a subset of the 32-bit ARM

instructions into a 16-bit instruction set space. So it has

higher code density: 30% less memory

Since Thumb has higher performance than ARM on a

processor with a 16-bit data bus, but

lower performance than ARM on a 32-bit data bus, use Thumb for memory-constrained systems.

Thumb Instruction set

5/21/2018 ARM-final

64/103



5/21/2018 ARM-final

65/103



Thumb register usage

5/21/2018 ARM-final

66/103


Thumb register usage

Code Density

5/21/2018 ARM-final

67/103


Code Density

Thumb instruction decoding

5/21/2018 ARM-final

68/103


Thumb instruction decoding

Thumb Instructions limitations

5/21/2018 ARM-final

69/103


Thumb Instruction s limitations

Only the branch relative instruction can be conditionally executed.

The limited space available in 16 bits causes the barrel shift

operationsASR, LSL, LSR, and ROR to be separate instructions

in the Thumb ISA.

there is no direct access to the CPSR or SPSR. So there are no

MSR- and MRS-equivalent Thumb instructions.

To alter the CPSR or SPSR, you must switch into ARM state to use

MSR and MRS.

Similarly, there are no coprocessor instructions in Thumb state.

You need to be in ARM state to access the coprocessor for

configuring cache and memory management.

ARM Thumb internetworking

5/21/2018 ARM-final

70/103


ARM Thumb internetworking

The method of linking ARM and Thumb code together for both

assembly and C/C++.

It handles the transition between the two states. Extra code, called a

veneer, is sometimes needed to carry out the transition.

ATPCS defines the ARM and Thumb procedure call standards.

To call a Thumb routine from an ARM routine, the core has to change

state of T bit of the CPSR.

The BX and BLX branch instructions cause a switch between ARM and

Thumb state while branching to a routine

MIX Mode Programming

5/21/2018 ARM-final

71/103


branch instructions

There are two versions of the BX or BLX instructions: an ARM

instruction and a Thumb equivalent.

The ARM BX instruction enters Thumb state only if bit 0 of the address

in Rn is set to binary 1; otherwise it enters ARM state.

The Thumb BX instruction does the same.

Syntax: BX Rm BLX Rm | label

Unlike the ARM version, the Thumb BX instruction cannot be

conditionally executed.

The conditional branch instruction is the only conditionally executed

instruction in Thumb state.

B branch BL branch with link

l r =(instruction address after the BL) + 1


5/21/2018 ARM-final

72/103



The Thumb data processing instructions are a subset of the ARM

data processing instructions.

Most Thumb data processing instructions operate on low registers

and update the cpsr. The exceptions are

MOV Rd,Rn ADD Rd,Rm CMP Rn,Rm

ADD sp, #immediate SUB sp, #immediate

ADD Rd,sp,#immediate ADD Rd,pc,#immediate

which can operate on the higher registers r8r14 and the pc.

These instructions, except for CMP, do not update the condition flags

in the cpsr when using the higher registers.

The CMP instruction, however, always updates the cpsr.

Single Register Load StoreI t ti

5/21/2018 ARM-final

73/103


Instructions

The Thumb instruction set supports load and storing registers, or LDR and

STR.

These instructions use two pre-indexed addressing modes: offset by register

and offset by immediate.

Load/store register [Rn, Rm]

Base register + offset [Rn, #immediate]

Relative [pc|sp, #immediate]

The offset by register uses a base register Rn + the register offset Rm.

The second uses the same base register Rn + a 5-bit immediate or a valuedependent on the data size.

The 5-bit offset encoded in the instruction is multiplied by one for byte

accesses, two for 16-bit accesses, and four for 32-bit accesses.

Multiple Register Load-Store I

5/21/2018 ARM-final

74/103


Multiple Register Load Store I

The Thumb versions of the load-store multiple instructions are reduced

forms of the ARM load-store multiple instructions. They only support the increment after (IA) addressing mode.

Syntax : IA Rn!, {low Register list}

LDMIA load multiple registers

{Rd}*N mem32[Rn + 4 N], Rn = Rn + 4 N

Here N is the number of registers in the list of registers. these instructions always update the base register Rn after execution.

The base register and list of registers are limited to the low registers r0

to r7.

MIX Mode Programmingt k i t ti

5/21/2018 ARM-final

75/103


stack instructions The Thumb stack operations are different from the equivalent ARM

instructions because they use the more traditional POP and PUSH

concept.

Syntax: POP {low_register_list{, pc}}

PUSH {low_register_list{, lr}}

POP pop registers from the stacks RdN mem32[sp+4 N], sp =

sp4 N


5/21/2018 ARM-final

76/103



No stack pointer in the instruction because the stack pointer

is fixed as register r13 in Thumb operations and sp isautomatically updated.

The list of registers is limited to the low registers r0 to r7.

The PUSH register list also can include the link register lr.

similarly the POP register list can include the pc.

This provides support for subroutine entry and exit.

The stack instructions only support full descending stackoperations.

MIX Mode Programmingft i t t

5/21/2018 ARM-final

77/103


software interrupt.

Similar to the ARM equivalent, the Thumb software interrupt (SWI)

instruction causes a software interrupt exception.

If any interrupt or exception flag is raised in Thumb state, the

processor automatically reverts back to ARM state to handle the

exception.

Syntax: SWI immediate

The Thumb SWI instruction has the same effect and nearly the same

syntax as the ARM equivalent.

It differs in that the SWI number is limited to the range 0 to 255 and it

is not conditionally executed.

ARM7 Family

5/21/2018 ARM-final

78/103


y

One significant variation in the ARM7 family is the ARM7TDMI-S.synthesizable.

ARM720Tincludes an MMU being capable of handling the Linux and

Microsoft embedded platform operating systems.

The processor also includes a unified 8K cache. The vector table canbe relocated to a higher address by setting a coprocessor 15 register.

Another variation is the ARM7EJ-S processor, also synthesizable,

provides both Java acceleration and the enhanced instructions but

without any memory protection ARM7EJ-S is quite differentsince it includes a five-stage pipeline

and executes ARMv5TEJ instructions.

ARM PROCESSOR VARIANTS

5/21/2018 ARM-final

79/103


ARM7,ARM9,ARM11

5/21/2018 ARM-final

80/103


, ,

ARM7, ARM9, ARM10 and ARM11 cores are directly dependent upon

the type and geometry of the manufacturing process, which has a direct effect

on the frequency (MHz) and powerconsumption (watts).

An ARM processor is an implementation of a specific instruction set

architecture (ISA).

The ISA has been continuously improved from the first ARM processor design.

Processors are grouped into implementation families (ARM7, ARM9, ARM10,

and ARM11) with similar characteristics.

ARM9 Family

5/21/2018 ARM-final

81/103


y

The ARM9family was announced in 1997,with five-stage

pipeline, can run at higher clock frequencies than the ARM7.

The extra stages improve the overall performance of the

processor.

The memory system has been redesigned to follow the Harvard

architecture, which separates the data D and instruct ion I

buses.

The first processor in the ARM9 family was the ARM920T, which

includes a separate D + I cache and an MMU for virtual memory

support.

ARM922T is a variation on the ARM920T but with half the D +I

cache size.

ARM9 Family

5/21/2018 ARM-final

82/103


y

The ARM940Tincludes a smaller D +I cache and an MPU designed

for applications that do not require a platform operating system.

Both ARM920T and ARM940T execute the architecture v4T

instructions.

The next processors are based on the ARM9E-S core, a synthesizable

version of the ARM9 core with the E extensions.

There are two variations: the ARM946E-Sand the ARM966E-S. Both

execute architecture v5TEinstructions.

They also support the optional embedded trace macro -cel l (ETM),

which allows a developer to trace instruction and data execution in realtime on the processor.

This is important when debugging applications with time-critical

segments.

ARM9 Family

5/21/2018 ARM-final

83/103


y

The ARM946E-Sincludes TCM, cache, and an MPU. The sizes of the TCM

and caches are configurable.

This processor is designed for use in embedded control applications that

require deterministic real-time response.

In contrast, the ARM966Edoes not have the MPU and cache extensions but

does have configurable TCMs. The latest core in the ARM9 product line is the ARM926EJ-S synthesizable

processor core, announced in 2000 the first ARM processor core to include

the Jazelletechnology, which accelerates Java byte-code execution.

It is designed for use in small portable Java-enabled devices such as 3G

phones and personal digital assistants (PDAs).

It features an MMU, configurable TCMs, and D +I caches with zero or

nonzero wait state memories.

ARM10 Family

5/21/2018 ARM-final

84/103


y

The ARM10, announced in 1999, was designed for

performance.

It extends the ARM9 pipeline to six stages.

It also supports an optional vector floating-point (VFP) unit,which

adds a seventh stage to the ARM10 pipeline. The VFP

significantly increases

floating-point performance and is compliant with the IEEE

754.1985 floating-point standard.

ARM10 Family

5/21/2018 ARM-final

85/103


y

The ARM1020Eis the first processor to use an ARM10Ecore.

Like the ARM9E,

It includes the enhanced E instructions. It has separate 32K D

+ I caches, opt ion al vector floating-point unit, and an MMU.

The ARM1020E also has a dual 64-bit bus interface for

increased performance.

ARM1026EJ-S is very similar to the ARM926EJ-S but with both

MPU and MMU. This processor has the performance of the ARM10 with the

flexibility of an ARM926EJ-S.

ARM11 Family

5/21/2018 ARM-final

86/103


y

The ARM1136J-S, announced in 2003, was designed for highperformance and power-efficient applications.

It was the first processor implementation to execute architecture

ARMv6 instructions.

It incorporates an eight-stage pipeline with separate load-storeand arithmetic pipelines.

Included in the ARMv6 instructions are single instruction multiple

data (SIMD) extensions for media processing, specifically

designed to increase video processing performance.

The ARM1136JF-Sis an ARM1136J-S with the addition of the

vector floating-point unit for fast floating-point operations.

Enhanced DSP features

5/21/2018 ARM-final

87/103


Processing digitized signals requires high memory bandwidths and fast

multiply accumulate operations.

A single-core design can reduce cost and power consumption over a

two-core solution.

DSP applications are typically multiply and load-store intensive.

A basic operation is a multiply accumulate multiplying two 16-bitsigned numbers and accumulating onto a 32-bit signed

accumulator.

The ARMv5TE extensions available in the ARM9E and later cores

provide efficient multiply accumulate operations.

With careful coding, the ARM9E processor will perform decently on

the DSP parts of an application while

outperforming a DSPon the control parts of the application.

Generations suitable for DSPapplications

5/21/2018 ARM-final

88/103


applications.

DSP Algorithms Characteristics

5/21/2018 ARM-final

89/103


Due to their high data bandwidth and performance requirements,

we have to code DSP algorithms in hand-written assembly.

We need fine control of register allocation and instruction

scheduling to achieve the best performance.

Filtering is probably the most commonly used signal processing

operation. It can be used to remove noise, to analyze signals, or in

signal compression.

Another very common algorithm is the Discrete Fourier Transform

(DFT), which converts a signal from a time representation to a

frequency representation or vice versa.

How to represent a signal on theARM

5/21/2018 ARM-final

90/103


ARM

Use a floating-point representation for prototyping algorithms. Do notuse floating point in applications where speed is critical. Most ARM

implementations do not include hardware floating-point support.

Use a fixed-point representation for DSP applications where speed

is critical with moderate dynamic range. The ARM cores provide good

support for 8-, 16- and 32-bit fixed-point DSP.

For applications requiring speed and high dynamic range, use a block-

floating or logarithmic representation.

The key idea is to use block algorithms that calculate several results

at once, and thus require less memory bandwidth, increaseperformance and decrease power consumption compared with

calculating single results.

5/21/2018 ARM-final

91/103


Figure shows a sine wave signal digitized at the sampling points 0, 1, 2,

3, and so on.

Dynamic Range &Accuracy

5/21/2018 ARM-final

92/103


There are two things to worry about when choosing a representation ofx[t

]:

1. The dynamic range of the signalthe maximum fluctuation in the

signal defined by Equation-A.

For a signed signal we are interested in the maximum absolute value M

possible. For this example, letstake M = 1 volt.

M = max|x[t ]| over all t = 0, 1, 2, 3 . . (A)

2. The accuracy required in the representation- sometimes given as

a proportion of the maximum range.

For example, an accuracy of 100 parts per million means that each x[t ]

needs to be represented within an error of

E = M 0. 0001 = 0. 0001 volts

Suitable Representation

5/21/2018 ARM-final

93/103


We could use a floating-point representation for x[t ].

1)This would certainly meet our dynamic range and accuracy

constraints, and

2) it would also be easy to manipulate using the C type float.

However, most ARM cores do not support floating point in

hardware, and so a floating-point representation would be very slow.

A better choice for fast code is a fixed-point representation.

A fixed-point representation uses an integer to represent a fractional

value by scaling the fraction.

Error Vs Accuracy

5/21/2018 ARM-final

94/103


A common error is to think that floating point is more accurate than fixed point. Thisis false!

For the same number of bits, a fixed-point representation gives greater accuracy.

The floating-point representation gives higher dynamic range at the expense of

lower absolute accuracy.

For example, if you use a 32-bit integer to hold a fixed-point value scaled to full

range, then the maximum error in a representation is 232. However, single-

precision 32-bit floating-point values give a relative error of 224.

The single-precision floating-point mantissa is 24 bits. The leading 1 of the mantissa

is not stored, so 23 bits of storage are actually used. For values near the maximum,

the fixed-point representation is 23224 = 256 times more accurate!

The 8-bit floating-point exponent is of little use when you are interested in maximum

error rather than relative accuracy.

Better representation

5/21/2018 ARM-final

95/103


To summarize, a fixed-point representation is best when there is aclear bound to the strength of the signal and when maximum error is

important.

When there is no clear bound and you require a large dynamic range,then floating point is better.

You can also use the other representations, which give more dynamic

range than fixed point while still being more efficient to implement than

floating point.

General rules on writing DSPalgorithms for the ARM

5/21/2018 ARM-final

96/103


algorithms for the ARM.

ARM does not provide operations that saturate automatically. Design the DSP

algorithm so that saturation is not required because saturation will cost extracycles.

ARM supports extended-precision 32-bit multiplied by 32-bit to 64-bit

operations very well. Use extended-precision arithmetic or additional scaling

rather than saturation.

The ARM core is not a dedicated DSP. There is no single instruction thatissues a multiply accumulate and data fetch in parallel. However, by reusing

loaded data you can achieve a respectable DSP performance.

Design the DSP algorithm to minimize loads and stores. Once you load a data

item, then perform as many operations that use the datum as possible. You

can often do this by calculating several output results at once. Another way of increasing reuse is to concatenate several operations. For

example, you could perform a dot product and signal scale at the same time,

while only loading the data once.

Guidelines for writing DSP code.

5/21/2018 ARM-final

97/103


FromARM9onwards,ARMimplementations use a multistage execute pipeline

for loads and multiplies, which introduces potential processor interlocks. If you load a value and then use it in either of the following two instructions, the

processor may stall for a number of cycles waiting for the loaded value to

arrive.

Similarly if you use the result of a multiply in the following instruction, this may

cause stall cycles. It is particularly important to schedule code to avoid thesestalls.

Write ARM assembly to avoid processor interlocks. The results of load and

multiply instructions are often not available to the next instruction without

adding stall cycles.

Sometimes the results will not be available for several cycles

There are 14 registers available for general use on the ARM, r0 to r12 and r14.

Design the DSP algorithm so that the inner loop will require 14 registers or

fewer.

An example- a DOT Product

5/21/2018 ARM-final

98/103


A dot-product is one of the simplest DSP operations and highlights the

difference among different ARM implementations.

A dot-product combines N samples from two signals x(t) and c(t) to

produce a correlation value a: a= Ci * Xi

The C interface to the dot-product function is

int dot_product(sample *x, coefficient *c, unsigned int N); where

sample is the type to hold a 16-bit audio sample, usually a short

coefficient is the type to hold a 16-bit coefficient, usually a short

x[i] and c[i] are two arrays of length N (the data and coefficients)

the function returns the accumulated 32-bit integer dot product a

DSP rating of ARM7TDMI

5/21/2018 ARM-final

99/103


This example shows a 16-bit dot-product optimized for the ARM7TDMI.Each MLA takes a worst case of four cycles.

We store the 16-bit input samples in 32-bit words so that we can use the

LDM instruction to load them efficiently.

This code assumes that the number of samples N is a multiple of five.

Therefore we can use a five-word load multiple to increase data bandwidth.

The cost per load is 7/4 = 1.4 cycles compared to 3 cycles per load if we had

used LDR or LDRSH.

The inner loop requires a worst case of 7 + 7 + 5 4 + 1 + 3 = 38 cycles to

process each block of 5 products from the sum.

This gives the ARM7TDMI a DSP rating of 38/5 = 7.6 cycles per tapfor a

16-bit dot-product.

Assembly code for DOT Product

5/21/2018 ARM-final

100/103


x RN 0 ; input array x[]

c RN 1 ; input array c[]

N RN 2 ; number of samples (a multiple

of 5)

acc RN 3 ; accumulator

x_0 RN 4 ; elements from array x[]

x_1 RN 5

x_2 RN 6

x_3 RN 7

x_4 RN 8

c_0 RN 9 ; elements from array c[]

c_1 RN 10

c_2 RN 11

c_3 RN 12

c_4 RN 14

; int dot_16by16_arm7m(int *x, int *c, unsigned N)

dot_16by16_arm7m

STMFD sp!, {r4-r11, lr}

MOV acc, #0

loop_7m ; accumulate 5 products

LDMIA x!, {x_0, x_1, x_2, x_3, x_4}

LDMIA c!, {c_0, c_1, c_2, c_3, c_4}

MLA acc, x_0, c_0, acc





SUBS N, N, #5

BGT loop_7m

MOV r0, acc

LDMFD sp!, {r4-r11, pc}

DSP Rating -16bit DOT Product

5/21/2018 ARM-final

101/103


ARM7TDMI a DSP rating of 38/5 = 7.6 cycles per tap

ARM9TDMI-The inner loop requires 28 cycles per tap, giving 28/4 = 7

cycles per tap.

STRONGARM-The inner loop uses 19 cycles to process 4 taps, giving

a rating of 19/4 = 4.75 cycles per tap.

ARM9E-The inner loop requires 20 cycles to accumulate 8 products, a

rating of 20/8 = 2.5 cycles per tap.

ARM10E-The inner loop requires 25 cycles to process 10 samples, or

2.5 cycles per tap.

Intel- XSCALE-The inner loop requires 14 cycles to accumulate 8

products, a rating of 1.75 cycles per tap.

Performance improvement in DSP

5/21/2018 ARM-final

102/103


The block

filter

algorithm

gives a

much betterperformance

per tap if you

are

calculating

multipleproducts.

5/21/2018 ARM-final

103/103

THANK YOU

Documents

ARM-final