Upload
rahul-gala
View
5
Download
0
Tags:
Embed Size (px)
DESCRIPTION
arm processor
Citation preview
5/21/2018 ARM-final
1/103
1TM 139v10 The ARM Architecture
ARM PROCESSOR,
ESRTP,
SEM-8,B.E.(ELECTRONICS)
WITH BEST OF LUCK FROM:
Prof. Vidya Gogate
SAKEC, Chembur
5/21/2018 ARM-final
2/103
2TMT H E A R C H I T E C T U R E F O R T H E D I G I T A L W O R L D
The ARM Architecture
5/21/2018 ARM-final
3/103
3TM 339v10 The ARM Architecture
Agenda
Introduction to ARM Ltd
Programmers Model
Instruction Set
System Design
Development Tools
5/21/2018 ARM-final
4/103
4TM 439v10 The ARM Architecture
ARM Ltd
Founded in November 1990
Spun out of Acorn Computers
Designs the ARM range of RISC processor
cores
Licenses ARM core designs to semiconductor
partners who fabricate and sell to their
customers.
ARM does not fabricate silicon itself
Also develop technologies to assist with thedesign-in of the ARM architecture
Software tools, boards, debug hardware,
application software, bus architectures,
peripherals etc
5/21/2018 ARM-final
5/1035TM 539v10 The ARM Architecture
ARM Partnership Model
5/21/2018 ARM-final
6/1036TM 639v10 The ARM Architecture
ARM Powered Products
5/21/2018 ARM-final
7/1037TM 739v10 The ARM Architecture
Latest NEWS
For 30 years, Intel basically had the market to itself and as a result, its
chips were priced much higher than ARM chips.
But ARMs entry into Intels territory is a real threat to Intels dominance.
In fact, Microsofts first version of the Surface tablet/keyboard device
uses ARM chips. The Intel-based versions will come out about three
month later.
http://techland.time.com/2012/06/18/surface-microsoft-made-windows-8-tablets-coming-this-year/http://techland.time.com/2012/06/18/surface-microsoft-made-windows-8-tablets-coming-this-year/5/21/2018 ARM-final
8/1038TM 839v10 The ARM Architecture
Latest NEWS
There has been an important new development in the processor world
lately. The folks behind the ARM processor the chip that powers most
smart phones and tablets today decided to scale up this processor
technology to run at speeds that could be used in advanced tablets andmore importantly, laptops and even desktops.
And ARM processors got a major boost when Microsoft made the
decision to create an ARM-based version of Windows 8. For the first time,
Microsoft broke away from whats been called the Win-Tel monopoly.
5/21/2018 ARM-final
9/1039TM 939v10 The ARM Architecture
Latest NEWS
But Microsofts new operating system for ARM, called Windows RT, has
touched off a new battlefront in processor wars.
This has forced Intel to try and make its x86 processors more energyefficient;
the company hopes to have chips that are on par with ARM by mid-2013.
Read more: http://techland.time.com/2012/07/16/arm-vs-intel-how-the-
processor-wars-will-benefit-consumers-most/#ixzz23XHpDxkn
http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/http://techland.time.com/2012/07/16/arm-vs-intel-how-the-processor-wars-will-benefit-consumers-most/5/21/2018 ARM-final
10/10310TM 1039v10 The ARM Architecture
RISC Design Philosophy
InstructionsRISC processors have a reduced number of instruction
classes which provide simple operations that can each execute in a
single cycle.
In contrast, in CISC processors the instructions are often of variable
size and take many cycles to execute.
PipelinesThe processing of instructions is broken down into smallerunits that can be executed in parallel by pipelines.
There is no need for an instruction to be executed by a mini-program
called microcode as on CISC processors.
RegistersRISC machines have a large general-purpose register set.Any register can contain either data or an address. Registers act as the
fast local memory store for all data processing operations.
In contrast, CISC processors have dedicated registers for specific
purposes.
5/21/2018 ARM-final
11/10311TM 1139v10 The ARM Architecture
RISC Design Philosophy
Load-store architectureThe processor operates on
data held in registers. Separate load and store instructions
transfer data between the register bank and external
memory.
Memory accesses are costly, so separating memory
accesses from data processing provides an advantage
because you can use data items held in the register bank
multiple times without needing multiple memory accesses.
In contrast, with a CISC design the data processing
operations can act on memory directly.
5/21/2018 ARM-final
12/10312TM 1239v10 The ARM Architecture
RISC Design Philosophy
These design rules allow a RISC processor to be simpler,
and thus the core can operate at higher clock
frequencies.
In contrast, traditional CISC processors are more complexand operate at lower clock frequencies.
5/21/2018 ARM-final
13/10313TM 1339v10 The ARM Architecture
ARM Design Philosophy
Portable embedded systems require some form of battery power. The
ARM processor has been specifically designed to be small to reduce
power consumption and extend battery operationessential for
applications such as PDAs.
Since embedded systems have limited memory due to cost and/orphysical size restrictions; High code density is useful feature of ARM
for applications that have limited on-board memory. The ability to use
low-cost memory devices produces substantial savings.
For a single-chip solution, the smaller the area used by the embedded
processor,(reduced die size) the more available space for specializedperipherals. This in turn reduces the cost of the design and
manufacturing since fewer discrete chips are required for the end
product.
5/21/2018 ARM-final
14/10314TM 1439v10 The ARM Architecture
ARM Design Philosophy
ARM has incorporated hardware debug technology
within the processor so that software engineers can view
what is happening while the processor is executing code.
With greater visibility, software engineers can resolve
issues faster, which has a direct effect on the time to
market and reduces overall development costs.
5/21/2018 ARM-final
15/10315TM 1539v10 The ARM Architecture
Instruction set FEATURES
Variable cycle execution for certain instructions-
load-store-multiple instructions vary in the number of execution cycles
depending upon the number of registers being transferred.
The transfer can occur on sequential memory addresses, which
increases performance since sequential memory accesses are often
faster than random accesses.
Code density is also improved since multiple register transfers are
common operations at the start and end of functions.
Inline barrel shifter leading to more complex instructions
The inline barrel shifter is a hardware component that preprocesses
one of the input registers before it is used by an instruction.
This expands the capability of many instructions to improve core
performance and code density.
5/21/2018 ARM-final
16/10316TM 1639v10 The ARM Architecture
Instruction set FEATURES
Thumb 16-bit instruction setThe 16-bit instructions improvecode density by about 30% over 32-bit fixed-length instructions.
Conditional execution- An instruction is only executed when a
specific condition has been satisfied. This feature improvesperformance and code density by reducing branch instructions.
Enhanced instructions - The enhanced digital signal processor
(DSP) instructions were added to the standard ARM instruction set tosupport fast 1616-bit multiplier operations and saturation. These
instructions allow a faster-performing ARM processor in some cases to
replace the traditional combinations of a processor plus a DSP.
5/21/2018 ARM-final
17/10317TM 1739v10 The ARM Architecture
Nomenclature.
Instruction set architecture (ISA) is upward compatible.
ARM{x}{y}{z}{T}{D}{M}{I}{E}{J}{F}{-S}
xfamily
ymemory management/protection unit
zcache
TThumb 16-bit decoder
DJTAG debug
Mfast multiplier
IEmbedded ICE macro-cell
Eenhanced instructions (assumes TDMI)
JJazelle
Fvector floating-point unit
Ssynthesizible version
5/21/2018 ARM-final
18/10318TM 1839v10 The ARM Architecture
Nomenclature
All ARM cores after the ARM7TDMI include the TDMIfeatures even though they may not include those letters
after the ARM label.
The processor family is a group of processor
implementations that share the same hardware
characteristics.
For example, the ARM7TDMI, ARM740T, and ARM720T all
share the same family characteristics and belong to the
ARM7 family.
5/21/2018 ARM-final
19/10319TM 1939v10 The ARM Architecture
Nomenclature
JTAG is described by IEEE 1149.1 Standard Test Access Portand boundary scan architecture.
It is a serial protocol used by ARM to send and receive
debug information between the processor core and test
equipment.
Embedded ICE macro-cell is the debug hardware built into
the processor that allows breakpoints and watch-points to
be set.
Synthesizable means that the processor core is supplied as
source code that can be compiled into a form easily used
by EDA tools.
5/21/2018 ARM-final
20/10320TM 2039v10 The ARM Architecture
Syllabus
ARM processor fundamentalsintroduction to ARM and THUMBinstruction set--
processor and memory organizationCPU Bus configuration
ARM BusMemory devicesInput/output devices
Component interfacingdesigning with microprocessordevelopment and debugging
Design Example
Instruction set with enhanced DSP features with ARM core, mix
mode programming as Thumb + ARM core,
Assembly programming concept, compare with ARM7, ARM9,
ARM11 with new features additions
5/21/2018 ARM-final
21/10321TM 2139v10 The ARM Architecture
PIN DIAGRAM
5/21/2018 ARM-final
22/10322TM 2239v10 The ARM Architecture
Architecture block diagram
5/21/2018 ARM-final
23/10323TM 2339v10 The ARM Architecture
Hardware Fundamentals
The ARM processor can be abstracted
into eight components
ALU, barrel shifter, MAC, register file,
instruction decoder, address register,
incrementer, and sign extend.
5/21/2018 ARM-final
24/103
24TM 2439v10 The ARM Architecture
Data Sizes and Instruction Sets
The ARM is a 32-bit architecture.
When used in relation to the ARM:
Bytemeans 8 bits
Halfwordmeans 16 bits (two bytes) Wordmeans 32 bits (four bytes)
Most ARMs implement two instruction sets
32-bit ARM Instruction Set
16-bit Thumb Instruction Set
Jazelle cores can also execute Java bytecode
5/21/2018 ARM-final
25/103
25TM 2539v10 The ARM Architecture
Processor Modes
The ARM has seven basic operating modes:
User: unprivileged mode under which most tasks run
FIQ: entered when a high priority (fast) interrupt is raised
IRQ: entered when a low priority (normal) interrupt is raised
Supervisor: entered on reset and when a Software Interrupt
instruction is executed
Abort: used to handle memory access violations
Undef: used to handle undefined instructions
System: privileged mode using the same registers as user mode
5/21/2018 ARM-final
26/103
26TM 2639v10 The ARM Architecture
r0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
r15 (pc)
cpsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
spsr
FIQ IRQ SVC Undef Abort
User Moder0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
r15 (pc)
cpsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
spsr
Current Visible Registers
Banked out Registers
FIQ IRQ SVC Undef Abort
r0
r1
r2
r3
r4
r5
r6
r7
r15 (pc)
cpsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
spsr
Current Visible Registers
Banked out Registers
User IRQ SVC Undef Abort
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
FIQ ModeIRQ Moder0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r15 (pc)
cpsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
spsr
Current Visible Registers
Banked out Registers
User FIQ SVC Undef Abort
r13 (sp)
r14 (lr)
Undef Moder0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r15 (pc)
cpsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
spsr
Current Visible Registers
Banked out Registers
User FIQ IRQ SVC Abort
r13 (sp)
r14 (lr)
SVC Moder0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r15 (pc)
cpsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
spsr
Current Visible Registers
Banked out Registers
User FIQ IRQ Undef Abort
r13 (sp)
r14 (lr)
Abort Moder0
r1
r2
r3
r4
r5
r6
r7
r8
r9
r10
r11
r12
r15 (pc)
cpsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r13 (sp)
r14 (lr)
spsr
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
spsr
Current Visible Registers
Banked out Registers
User FIQ IRQ SVC Undef
r13 (sp)
r14 (lr)
The ARM Register Set
5/21/2018 ARM-final
27/103
27TM 2739v10 The ARM Architecture
Register Organization Summary
Usermoder0-r7,r15,andcpsr
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)
spsr
FIQ
r8
r9
r10
r11
r12
r13 (sp)
r14 (lr)r15 (pc)
cpsr
r0r1
r2
r3
r4
r5
r6
r7
User
r13 (sp)
r14 (lr)
spsr
IRQ
Usermoder0-r12,r15,
andcpsr
r13 (sp)
r14 (lr)
spsr
Undef
Usermoder0-r12,r15,
andcpsr
r13 (sp)
r14 (lr)
spsr
SVC
Usermoder0-r12,r15,
andcpsr
r13 (sp)
r14 (lr)
spsr
Abort
Usermoder0-r12,r15,
andcpsr
Thumb state
Low registers
Thumb state
High registers
Note: System mode uses the User mode register set
5/21/2018 ARM-final
28/103
28TM 2839v10 The ARM Architecture
The Registers
ARM has 37 registers all of which are 32-bits long. 1 dedicated program counter
1 dedicated current program status register
5 dedicated saved program status registers
30 general purpose registers
The current processor mode governs which of several banks is
accessible. Each mode can access
a particular set of r0-r12registers
a particular r13(the stack pointer, sp) and r14(the link register,lr)
the program counter,r15 (pc)
the current program status register, cpsr
Privileged modes (except System) can also access
a particular spsr(saved program status register)
5/21/2018 ARM-final
29/103
29TM 2939v10 The ARM Architecture
Program Status Registers
Condition code flags
N =Negative result from ALU
Z = Zero result from ALU
C = ALU operation Carried out
V = ALU operation oVerflowed
Sticky Overflow flag - Q flag
Architecture 5TE/J only
Indicates if saturation has occurred
J bit
Architecture 5TEJ only
J = 1: Processor in Jazelle state
Interrupt Disable bits.
I = 1: Disables the IRQ.
F = 1: Disables the FIQ.
T Bit
Architecture xT only
T = 0: Processor in ARM state
T = 1: Processor in Thumb state
Mode bits Specify the processor mode
2731
N Z C V Q28 67
I F T mode1623 815 5 4 024
f s x c
U n d e f i n e dJ
5/21/2018 ARM-final
30/103
30TM 3039v10 The ARM Architecture
When the processor is executing in ARM state:
All instructions are 32 bits wide
All instructions must be word aligned
Therefore the pcvalue is stored in bits [31:2] with bits [1:0] undefined (as
instruction cannot be halfword or byte aligned).
When the processor is executing in Thumb state:
All instructions are 16 bits wide
All instructions must be halfword aligned
Therefore the pcvalue is stored in bits [31:1] with bit [0] undefined (asinstruction cannot be byte aligned).
When the processor is executing in Jazelle state:
All instructions are 8 bits wide
Processor performs a word access to read 4 instructions at once
Program Counter (r15)
5/21/2018 ARM-final
31/103
31TM 3139v10 The ARM Architecture
Vector Table
Exception Handling
When an exception occurs, the ARM: Copies CPSR into SPSR_
Sets appropriate CPSR bits
Change to ARM state
Change to exception mode
Disable interrupts (if appropriate)
Stores the return address in LR_
Sets PC to vector address
To return, exception handler needs to:
Restore CPSR from SPSR_
Restore PC from LR_
This can only be done in ARM state.
Vector table can be at
0xFFFF0000on ARM720Tand on ARM9/10 family devices
FIQIRQ
(Reserved)Data Abort
Prefetch Abort
Software Interrupt
Undefined Instruction
Reset
0x1C0x18
0x140x10
0x0C
0x08
0x04
0x00
5/21/2018 ARM-final
32/103
32TM 3239v10 The ARM Architecture
Instruction set
ARM has three instruction setsARM, Thumb, and Jazelle.
The register file contains 37 registers, but only 17 or 18 registers
are accessible at any point in time; the rest are banked
according to processor mode.
The current processor mode is stored in the CPSR
It holds the current status of the processor core as well as
interrupt masks, condition flags, and state.
The state determines which instruction set is being executed.
5/21/2018 ARM-final
33/103
33TM 3339v10 The ARM Architecture
ARM instructions can be made to execute conditionally by postfixing
them with the appropriate condition code field.
This improves code density andperformance by reducing the number of
forward branch instructions.CMP r3,#0 CMP r3,#0BEQ skip ADDNE r0,r1,r2ADD r0,r1,r2
skip
By default, data processing instructions do not affect the condition code
flags but the flags can be optionally set by using S. CMP does notneed S.
loopSUBS r1,r1,#1BNE loop
if Z flag clear then branch
decrement r1 and set flags
Conditional Execution and Flags
5/21/2018 ARM-final
34/103
34TM 3439v10 The ARM Architecture
Condition Codes
Not equal
Unsigned higher or sameUnsigned lower
Minus
Equal
Overflow
No overflow
Unsigned higher
Unsigned lower or same
Positive or Zero
Less than
Greater than
Less than or equal
Always
Greater or equal
EQNE
CS/HSCC/LO
PLVS
HI
LSGELTGTLEAL
MI
VC
Suffix Description
Z=0
C=1C=0
Z=1
Flags tested
N=1
N=0
V=1
V=0
C=1 & Z=0
C=0 or Z=1N=V
N!=V
Z=0 & N=V
Z=1 or N=!V
The possible condition codes are listed below: Note AL is the default and does not need to be specified
Examples of conditional
5/21/2018 ARM-final
35/103
35TM 3539v10 The ARM Architecture
Examples of conditional
execution
Use a sequence of several conditional instructions
if (a==0) func(1);CMP r0,#0MOVEQ r0,#1BLEQ func
Set the flags, then use various condition codes
if (a==0) x=0;if (a>0) x=1;
CMP r0,#0MOVEQ r1,#0MOVGT r1,#1
Use conditional compare instructionsif (a==4 || a==10) x=0;
CMP r0,#4CMPNE r0,#10MOVEQ r1,#0
5/21/2018 ARM-final
36/103
36TM 3639v10 The ARM Architecture
Branch : B{} label
Branch with Link : BL{} subroutine_label
The processor core shifts the offset field left by 2 positions, sign-extends
it and adds it to the PC
32 Mbyte range
How to perform longer branches?
2831 24 0
Cond 1 0 1 L Offset
Condition field
Link bit 0 = Branch1 = Branch with link
232527
Branch instructions
5/21/2018 ARM-final
37/103
37TM 3739v10 The ARM Architecture
Data processing Instructions
Consist of : Arithmetic: ADD ADC SUB SBC RSB RSC Logical: AND ORR EOR BIC Comparisons: CMP CMN TST TEQ Data movement: MOV MVN
These instructions only work on registers, NOT memory.
Syntax:
{}{S} Rd, Rn, Operand2
Comparisons set flags only - they do not specify Rd
Data movement does not specify Rn
Second operand is sent to the ALU via barrel shifter.
5/21/2018 ARM-final
38/103
38TM 3839v10 The ARM Architecture
The Barrel Shifter
DestinationCF 0 Destination CF
LSL : Logical Left Shift ASR: Arithmetic Right Shift
Multiplication by a power of 2 Division by a power of 2,preserving the sign bit
Destination CF...0 Destination CF
LSR : Logical Shift Right ROR: Rotate Right
Division by a power of 2 Bit rotate with wrap aroundfrom LSB to MSB
Destination
RRX: Rotate Right Extended
Single bit rotate with wrap aroundfrom CF to MSB
CF
Using the Barrel Shifter:
5/21/2018 ARM-final
39/103
39TM 3939v10 The ARM Architecture
Register, optionally with shift operation Shift value can be either be:
5 bit unsigned integer
Specified in bottom byte of another
register.
Used for multiplication by constant
Immediate value
8 bit number, with a range of 0-255.
Rotated right through even number of
positions
Allows increased range of 32-bitconstants to be loaded directly into
registers
Result
Operand1
BarrelShifter
Operand2
ALU
Using the Barrel Shifter:
The Second Operand
5/21/2018 ARM-final
40/103
40TM 4039v10 The ARM Architecture
No ARM instruction can contain a 32 bit immediate constant
All ARM instructions are fixed as 32 bits long
The data processing instruction format has 12 bits available for operand2
4 bit rotate value (0-15) is multiplied by two to give range 0-30 in steps of 2
Rule to remember is 8-bits shifted by an even number of bit positions.
0711 8
immed_8
Shifter
ROR
rot
x2
Quick Quiz:
0xe3a004ffMOV r0, #???
Immediate constants (1)
5/21/2018 ARM-final
41/103
41TM 4139v10 The ARM Architecture
Examples:
The assembler converts immediate values to the rotate form: MOV r0,#4096 ; uses 0x40 ror 26 ADD r1,r2,#0xFF0000 ; uses 0xFF ror 16
The bitwise complements can also be formed using MVN: MOV r0, #0xFFFFFFFF ; assembles to MVN r0,#0
Values that cannot be generated in this way will cause an error.
031
ror #0
range 0-0xff000000 step 0x01000000ror #8
range 0-0x000000ff step 0x00000001
range 0-0x000003fc step 0x00000004ror #30
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Immediate constants (2)
5/21/2018 ARM-final
42/103
42TM 4239v10 The ARM Architecture
To allow larger constants to be loaded, the assembler offers a pseudo-instruction: LDR rd, =const
This will either: Produce aMOVorMVNinstruction to generate the value (if possible).
or Generate a LDRinstruction with a PC-relative address to read the constant
from a literal pool(Constant data area embedded in the code).
For example LDR r0,=0xFF => MOV r0,#0xFF LDR r0,=0x55555555 => LDR r0,[PC,#Imm12]
DCD 0x55555555
This is the recommended way of loading constants into a register
Loading 32 bit constants
5/21/2018 ARM-final
43/103
43TM 4339v10 The ARM Architecture
Multiply
Syntax:
MUL{}{S} Rd, Rm, Rs Rd = Rm * Rs MLA{}{S} Rd,Rm,Rs,Rn Rd = (Rm * Rs) + Rn
[U|S]MULL{}{S} RdLo, RdHi, Rm, Rs RdHi,RdLo := Rm*Rs
[U|S]MLAL{}{S} RdLo, RdHi, Rm, Rs RdHi,RdLo := (Rm*Rs)+RdHi,RdLo
Cycle time
Basic MUL instruction
2-5 cycles on ARM7TDMI
1-3 cycles on StrongARM/XScale
2 cycles on ARM9E/ARM102xE
+1 cycle for ARM9TDMI (over ARM7TDMI)
+1 cycle for accumulate (not on 9E though result delay is one cycle longer)
+1 cycle for long
Above are general rules - refer to the TRM for the core you are using
for the exact details
S f
5/21/2018 ARM-final
44/103
44TM 4439v10 The ARM Architecture
Single register data transfer
LDR STR WordLDRB STRB Byte
LDRH STRH HalfwordLDRSB Signed byte load
LDRSH Signed halfword load
Memory system must support all access sizes
Syntax:
LDR
{}{} Rd,
STR{}{} Rd,
e.g. LDREQB
Add d
5/21/2018 ARM-final
45/103
45TM 4539v10 The ARM Architecture
Address accessed
Address accessed by LDR/STR is specified by a base register plus an
offset
For word and unsigned byte accesses, offset can be An unsigned 12-bit immediate value (ie 0 - 4095 bytes).
LDR r0,[r1,#8]
A register, optionally shifted by an immediate valueLDR r0,[r1,r2]LDR r0,[r1,r2,LSL#2]
This can be either added or subtracted from the base register:LDR r0,[r1,#-8]LDR r0,[r1,-r2]LDR r0,[r1,-r2,LSL#2]
For halfword and signed halfword / byte, offset can be: An unsigned 8 bit immediate value (ie 0-255 bytes).
A register (unshifted).
Choice of pre-indexedor post- indexedaddressing
P P t I d d Add i ?
5/21/2018 ARM-final
46/103
46TM 4639v10 The ARM Architecture
0x5
0x5
r1
0x200Base
Register 0x200
r00x5
SourceRegisterfor STR
Offset12 0x20c
r1
0x200
OriginalBase
Register0x200
r0
0x5Source
Registerfor STR
Offset
12 0x20c
r1
0x20c
UpdatedBase
Register
Auto-update form:STR r0,[r1,#12]!
Pre or Post Indexed Addressing?
Pre-indexed: STR r0,[r1,#12]
Post-indexed: STR r0,[r1],#12
LDM / STM ti
5/21/2018 ARM-final
47/103
47TM 4739v10 The ARM Architecture
LDM / STM operation
Syntax:
{} Rb{!},
4 addressing modes:
LDMIA/ STMIA increment after
LDMIB/ STMIB increment before
LDMDA/ STMDA decrement after
LDMDB/ STMDB decrement beforeIA
r1 Increasing
Address
r4
r0
r1
r4
r0
r1
r4
r0 r1
r4
r0
r10
IB DA DB
LDMxx r10, {r0,r1,r4}STMxx r10, {r0,r1,r4}
Base Register (Rb)
S ft I t t (SWI)
5/21/2018 ARM-final
48/103
48TM 4839v10 The ARM Architecture
Software Interrupt (SWI)
Causes an exception trap to the SWI hardware vector The SWI handler can examine the SWI number to decide what operation
has been requested.
By using the SWI mechanism, an operating system can implement a set
of privileged operations which applications running in user mode can
request. Syntax:
SWI{}
2831 2427 0
Cond 1 1 1 1 SWI number (ignored by processor)
23
Condition Field
PSR T f I t ti
5/21/2018 ARM-final
49/103
49TM 4939v10 The ARM Architecture
PSR Transfer Instructions
MRS and MSR allow contents of CPSR / SPSR to be transferred to / from
a general purpose register.
Syntax:
MRS{} Rd, ; Rd = MSR{} ,Rm ; = Rm
where = CPSR or SPSR
[_fields] = any combination of fsxc
Also an immediate form MSR{} ,#Immediate
In User Mode, all bits can be read but only the condition flags (_f) can be
written.
2731
N Z C V Q28 67
I F T mode1623 815 5 4 024
f s x c
U n d e f i n e dJ
ARM B h d S b ti
5/21/2018 ARM-final
50/103
50TM 5039v10 The ARM Architecture
ARM Branches and Subroutines
B
PC relative. 32 Mbyte range.
BL
Stores return address in LR
Returning implemented by restoring the PC from LR
For non-leaf functions, LR will have to be stacked
STMFDsp!,{regs,lr}
:
BL func2
:
LDMFDsp!,{regs,pc}
func1 func2
:
:
BL func1
:
:
:
:
:
::
MOV pc, lr
Th b
5/21/2018 ARM-final
51/103
51TM 5139v10 The ARM Architecture
Thumb
Thumb is a 16-bit instruction set
Optimised for code density from C code (~65% of ARM code size)
Improved performance from narrow memory
Subset of the functionality of the ARM instruction set
Core has additional execution state - Thumb
Switch between ARM and Thumb using BXinstruction
015
31 0
ADDS r2,r2,#1
ADD r2,#1
32-bit ARM Instruction
16-bit Thumb Instruction
For most instructions generated by compiler:
Conditional execution is not used
Source and destination registers identical
Only Low registers used Constants are of limited size
Inline barrel shifter not used
A d
5/21/2018 ARM-final
52/103
52TM 5239v10 The ARM Architecture
Agenda
Introduction
Programmers Model
Instruction Sets
System Design
Development Tools
E l ARM b d S t
5/21/2018 ARM-final
53/103
53TM 5339v10 The ARM Architecture
Example ARM-based System
16 bit RAM
8 bit ROM
32 bit RAM
ARMCore
I/OPeripherals
Interrupt
ControllernFIQnIRQ
ARM Based microcontroller
5/21/2018 ARM-final
54/103
54TM 5439v10 The ARM Architecture
ARM Based microcontroller
A Basic ARM MEMORY SYSTEM
5/21/2018 ARM-final
55/103
55TM 5539v10 The ARM Architecture
A Basic ARM MEMORY SYSTEM.
AMBA
5/21/2018 ARM-final
56/103
56TM 5639v10 The ARM Architecture
AMBA
Bridge
Timer
On-chipRAM
ARM
InterruptController
Remap/Pause
TIC
Arbiter
Bus InterfaceExternalROM
ExternalRAM
Reset
System Bus Peripheral Bus
AMBA
Advanced Microcontroller Bus
Architecture
ADK
Complete AMBA Design Kit
ACT
AMBA Compliance Testbench
PrimeCell
ARMs AMBA compliant peripherals
AHB or ASB APB
ExternalBus
Interface
Decoder
System Design Hardware
5/21/2018 ARM-final
57/103
57TM 5739v10 The ARM Architecture
System Design-Hardware
An embedded system includes the following hardwarecomponents:
ARMprocessors are found embedded in chips.
Programmers accessperipherals through memory-mappedregisters.
There is a special type of peripheral called a controller,
which embedded systems use to configure higher-level
functions such as memory and interrupts. The AMBA on-chip bus is used to connect the processor
and peripherals together.
System Design Software
5/21/2018 ARM-final
58/103
58TM 5839v10 The ARM Architecture
System Design-Software
An embedded system also includes the following softwarecomponents:
Init ial ization co deconfigures the hardware to a known state.
Once configured, operat ing sys tems can be loaded and executed.
Operating systems provide a common programming environment for
the use of hardware resources and infrastructure.
Device d r iversprovide a standard interface to peripherals.
An appl icat ion Program performs the task-specific duties of anembedded system.
Agenda
5/21/2018 ARM-final
59/103
59TM 5939v10 The ARM Architecture
Agenda
Introduction
Programmers Model
Instruction Sets
System Design
Development Tools
The RealView Product Families
5/21/2018 ARM-final
60/103
60TM 6039v10 The ARM Architecture
The RealView Product Families
Debug ToolsAXD (part of ADS)
Trace Debug Tools
Multi-ICE
Multi-Trace
PlatformsARMulator (part of ADS)
Integrator Family
Compilation ToolsARM Developer Suite (ADS)
Compilers (C/C++ ARM & Thumb),
Linker & Utilities
RealView Compilation Tools (RVCT) RealView Debugger (RVD)
RealView ICE (RVI)
RealView Trace (RVT)
RealView ARMulator ISS (RVISS)
ARM Debug Architecture
5/21/2018 ARM-final
61/103
61TM 6139v10 The ARM Architecture
ARM Debug Architecture
ARM
core
ETM
TAP
controller
Trace PortJTAG port
Ethernet
Debugger (+ optional
trace tools)
EmbeddedICE Logic
Provides breakpoints and processor/system
access
JTAG interface (ICE)
Converts debugger commands to JTAG
signals
Embedded trace Macrocell (ETM) Compresses real-time instruction and data
access trace
Contains ICE features (trigger & filter logic)
Trace port analyzer (TPA)
Captures trace in a deep buffer
EmbeddedICE
Logic
5/21/2018 ARM-final
62/103
Thumb instruction set
5/21/2018 ARM-final
63/103
63TM 6339v10 The ARM Architecture
Thumb instruction set
Thumb instruction set encodes a subset of the 32-bit ARM
instructions into a 16-bit instruction set space. So it has
higher code density: 30% less memory
Since Thumb has higher performance than ARM on a
processor with a 16-bit data bus, but
lower performance than ARM on a 32-bit data bus, use Thumb for memory-constrained systems.
Thumb Instruction set
5/21/2018 ARM-final
64/103
64TM 6439v10 The ARM Architecture
Thumb Instruction set
5/21/2018 ARM-final
65/103
65TM 6539v10 The ARM Architecture
Thumb Instruction set
Thumb register usage
5/21/2018 ARM-final
66/103
66TM 6639v10 The ARM Architecture
Thumb register usage
Code Density
5/21/2018 ARM-final
67/103
67TM 6739v10 The ARM Architecture
Code Density
Thumb instruction decoding
5/21/2018 ARM-final
68/103
68TM 6839v10 The ARM Architecture
Thumb instruction decoding
Thumb Instructions limitations
5/21/2018 ARM-final
69/103
69TM 6939v10 The ARM Architecture
Thumb Instruction s limitations
Only the branch relative instruction can be conditionally executed.
The limited space available in 16 bits causes the barrel shift
operationsASR, LSL, LSR, and ROR to be separate instructions
in the Thumb ISA.
there is no direct access to the CPSR or SPSR. So there are no
MSR- and MRS-equivalent Thumb instructions.
To alter the CPSR or SPSR, you must switch into ARM state to use
MSR and MRS.
Similarly, there are no coprocessor instructions in Thumb state.
You need to be in ARM state to access the coprocessor for
configuring cache and memory management.
ARM Thumb internetworking
5/21/2018 ARM-final
70/103
70TM 7039v10 The ARM Architecture
ARM Thumb internetworking
The method of linking ARM and Thumb code together for both
assembly and C/C++.
It handles the transition between the two states. Extra code, called a
veneer, is sometimes needed to carry out the transition.
ATPCS defines the ARM and Thumb procedure call standards.
To call a Thumb routine from an ARM routine, the core has to change
state of T bit of the CPSR.
The BX and BLX branch instructions cause a switch between ARM and
Thumb state while branching to a routine
MIX Mode Programming
5/21/2018 ARM-final
71/103
71TM 7139v10 The ARM Architecture
branch instructions
There are two versions of the BX or BLX instructions: an ARM
instruction and a Thumb equivalent.
The ARM BX instruction enters Thumb state only if bit 0 of the address
in Rn is set to binary 1; otherwise it enters ARM state.
The Thumb BX instruction does the same.
Syntax: BX Rm BLX Rm | label
Unlike the ARM version, the Thumb BX instruction cannot be
conditionally executed.
The conditional branch instruction is the only conditionally executed
instruction in Thumb state.
B branch BL branch with link
l r =(instruction address after the BL) + 1
MIX Mode Programming
5/21/2018 ARM-final
72/103
72TM 7239v10 The ARM Architecture
MIX Mode Programming
The Thumb data processing instructions are a subset of the ARM
data processing instructions.
Most Thumb data processing instructions operate on low registers
and update the cpsr. The exceptions are
MOV Rd,Rn ADD Rd,Rm CMP Rn,Rm
ADD sp, #immediate SUB sp, #immediate
ADD Rd,sp,#immediate ADD Rd,pc,#immediate
which can operate on the higher registers r8r14 and the pc.
These instructions, except for CMP, do not update the condition flags
in the cpsr when using the higher registers.
The CMP instruction, however, always updates the cpsr.
Single Register Load StoreI t ti
5/21/2018 ARM-final
73/103
73TM 7339v10 The ARM Architecture
Instructions
The Thumb instruction set supports load and storing registers, or LDR and
STR.
These instructions use two pre-indexed addressing modes: offset by register
and offset by immediate.
Load/store register [Rn, Rm]
Base register + offset [Rn, #immediate]
Relative [pc|sp, #immediate]
The offset by register uses a base register Rn + the register offset Rm.
The second uses the same base register Rn + a 5-bit immediate or a valuedependent on the data size.
The 5-bit offset encoded in the instruction is multiplied by one for byte
accesses, two for 16-bit accesses, and four for 32-bit accesses.
Multiple Register Load-Store I
5/21/2018 ARM-final
74/103
74TM 7439v10 The ARM Architecture
Multiple Register Load Store I
The Thumb versions of the load-store multiple instructions are reduced
forms of the ARM load-store multiple instructions. They only support the increment after (IA) addressing mode.
Syntax : IA Rn!, {low Register list}
LDMIA load multiple registers
{Rd}*N mem32[Rn + 4 N], Rn = Rn + 4 N
Here N is the number of registers in the list of registers. these instructions always update the base register Rn after execution.
The base register and list of registers are limited to the low registers r0
to r7.
MIX Mode Programmingt k i t ti
5/21/2018 ARM-final
75/103
75TM 7539v10 The ARM Architecture
stack instructions The Thumb stack operations are different from the equivalent ARM
instructions because they use the more traditional POP and PUSH
concept.
Syntax: POP {low_register_list{, pc}}
PUSH {low_register_list{, lr}}
POP pop registers from the stacks RdN mem32[sp+4 N], sp =
sp4 N
MIX Mode Programming
5/21/2018 ARM-final
76/103
76TM 7639v10 The ARM Architecture
MIX Mode Programming
No stack pointer in the instruction because the stack pointer
is fixed as register r13 in Thumb operations and sp isautomatically updated.
The list of registers is limited to the low registers r0 to r7.
The PUSH register list also can include the link register lr.
similarly the POP register list can include the pc.
This provides support for subroutine entry and exit.
The stack instructions only support full descending stackoperations.
MIX Mode Programmingft i t t
5/21/2018 ARM-final
77/103
77TM 7739v10 The ARM Architecture
software interrupt.
Similar to the ARM equivalent, the Thumb software interrupt (SWI)
instruction causes a software interrupt exception.
If any interrupt or exception flag is raised in Thumb state, the
processor automatically reverts back to ARM state to handle the
exception.
Syntax: SWI immediate
The Thumb SWI instruction has the same effect and nearly the same
syntax as the ARM equivalent.
It differs in that the SWI number is limited to the range 0 to 255 and it
is not conditionally executed.
ARM7 Family
5/21/2018 ARM-final
78/103
78TM 7839v10 The ARM Architecture
y
One significant variation in the ARM7 family is the ARM7TDMI-S.synthesizable.
ARM720Tincludes an MMU being capable of handling the Linux and
Microsoft embedded platform operating systems.
The processor also includes a unified 8K cache. The vector table canbe relocated to a higher address by setting a coprocessor 15 register.
Another variation is the ARM7EJ-S processor, also synthesizable,
provides both Java acceleration and the enhanced instructions but
without any memory protection ARM7EJ-S is quite differentsince it includes a five-stage pipeline
and executes ARMv5TEJ instructions.
ARM PROCESSOR VARIANTS
5/21/2018 ARM-final
79/103
79TM 7939v10 The ARM Architecture
ARM7,ARM9,ARM11
5/21/2018 ARM-final
80/103
80TM 8039v10 The ARM Architecture
, ,
ARM7, ARM9, ARM10 and ARM11 cores are directly dependent upon
the type and geometry of the manufacturing process, which has a direct effect
on the frequency (MHz) and powerconsumption (watts).
An ARM processor is an implementation of a specific instruction set
architecture (ISA).
The ISA has been continuously improved from the first ARM processor design.
Processors are grouped into implementation families (ARM7, ARM9, ARM10,
and ARM11) with similar characteristics.
ARM9 Family
5/21/2018 ARM-final
81/103
81TM 8139v10 The ARM Architecture
y
The ARM9family was announced in 1997,with five-stage
pipeline, can run at higher clock frequencies than the ARM7.
The extra stages improve the overall performance of the
processor.
The memory system has been redesigned to follow the Harvard
architecture, which separates the data D and instruct ion I
buses.
The first processor in the ARM9 family was the ARM920T, which
includes a separate D + I cache and an MMU for virtual memory
support.
ARM922T is a variation on the ARM920T but with half the D +I
cache size.
ARM9 Family
5/21/2018 ARM-final
82/103
82TM 8239v10 The ARM Architecture
y
The ARM940Tincludes a smaller D +I cache and an MPU designed
for applications that do not require a platform operating system.
Both ARM920T and ARM940T execute the architecture v4T
instructions.
The next processors are based on the ARM9E-S core, a synthesizable
version of the ARM9 core with the E extensions.
There are two variations: the ARM946E-Sand the ARM966E-S. Both
execute architecture v5TEinstructions.
They also support the optional embedded trace macro -cel l (ETM),
which allows a developer to trace instruction and data execution in realtime on the processor.
This is important when debugging applications with time-critical
segments.
ARM9 Family
5/21/2018 ARM-final
83/103
83TM 8339v10 The ARM Architecture
y
The ARM946E-Sincludes TCM, cache, and an MPU. The sizes of the TCM
and caches are configurable.
This processor is designed for use in embedded control applications that
require deterministic real-time response.
In contrast, the ARM966Edoes not have the MPU and cache extensions but
does have configurable TCMs. The latest core in the ARM9 product line is the ARM926EJ-S synthesizable
processor core, announced in 2000 the first ARM processor core to include
the Jazelletechnology, which accelerates Java byte-code execution.
It is designed for use in small portable Java-enabled devices such as 3G
phones and personal digital assistants (PDAs).
It features an MMU, configurable TCMs, and D +I caches with zero or
nonzero wait state memories.
ARM10 Family
5/21/2018 ARM-final
84/103
84TM 8439v10 The ARM Architecture
y
The ARM10, announced in 1999, was designed for
performance.
It extends the ARM9 pipeline to six stages.
It also supports an optional vector floating-point (VFP) unit,which
adds a seventh stage to the ARM10 pipeline. The VFP
significantly increases
floating-point performance and is compliant with the IEEE
754.1985 floating-point standard.
ARM10 Family
5/21/2018 ARM-final
85/103
85TM 8539v10 The ARM Architecture
y
The ARM1020Eis the first processor to use an ARM10Ecore.
Like the ARM9E,
It includes the enhanced E instructions. It has separate 32K D
+ I caches, opt ion al vector floating-point unit, and an MMU.
The ARM1020E also has a dual 64-bit bus interface for
increased performance.
ARM1026EJ-S is very similar to the ARM926EJ-S but with both
MPU and MMU. This processor has the performance of the ARM10 with the
flexibility of an ARM926EJ-S.
ARM11 Family
5/21/2018 ARM-final
86/103
86TM 8639v10 The ARM Architecture
y
The ARM1136J-S, announced in 2003, was designed for highperformance and power-efficient applications.
It was the first processor implementation to execute architecture
ARMv6 instructions.
It incorporates an eight-stage pipeline with separate load-storeand arithmetic pipelines.
Included in the ARMv6 instructions are single instruction multiple
data (SIMD) extensions for media processing, specifically
designed to increase video processing performance.
The ARM1136JF-Sis an ARM1136J-S with the addition of the
vector floating-point unit for fast floating-point operations.
Enhanced DSP features
5/21/2018 ARM-final
87/103
87TM 8739v10 The ARM Architecture
Processing digitized signals requires high memory bandwidths and fast
multiply accumulate operations.
A single-core design can reduce cost and power consumption over a
two-core solution.
DSP applications are typically multiply and load-store intensive.
A basic operation is a multiply accumulate multiplying two 16-bitsigned numbers and accumulating onto a 32-bit signed
accumulator.
The ARMv5TE extensions available in the ARM9E and later cores
provide efficient multiply accumulate operations.
With careful coding, the ARM9E processor will perform decently on
the DSP parts of an application while
outperforming a DSPon the control parts of the application.
Generations suitable for DSPapplications
5/21/2018 ARM-final
88/103
88TM 8839v10 The ARM Architecture
applications.
DSP Algorithms Characteristics
5/21/2018 ARM-final
89/103
89TM 8939v10 The ARM Architecture
Due to their high data bandwidth and performance requirements,
we have to code DSP algorithms in hand-written assembly.
We need fine control of register allocation and instruction
scheduling to achieve the best performance.
Filtering is probably the most commonly used signal processing
operation. It can be used to remove noise, to analyze signals, or in
signal compression.
Another very common algorithm is the Discrete Fourier Transform
(DFT), which converts a signal from a time representation to a
frequency representation or vice versa.
How to represent a signal on theARM
5/21/2018 ARM-final
90/103
90TM 9039v10 The ARM Architecture
ARM
Use a floating-point representation for prototyping algorithms. Do notuse floating point in applications where speed is critical. Most ARM
implementations do not include hardware floating-point support.
Use a fixed-point representation for DSP applications where speed
is critical with moderate dynamic range. The ARM cores provide good
support for 8-, 16- and 32-bit fixed-point DSP.
For applications requiring speed and high dynamic range, use a block-
floating or logarithmic representation.
The key idea is to use block algorithms that calculate several results
at once, and thus require less memory bandwidth, increaseperformance and decrease power consumption compared with
calculating single results.
5/21/2018 ARM-final
91/103
91TM 9139v10 The ARM Architecture
Figure shows a sine wave signal digitized at the sampling points 0, 1, 2,
3, and so on.
Dynamic Range &Accuracy
5/21/2018 ARM-final
92/103
92TM 9239v10 The ARM Architecture
There are two things to worry about when choosing a representation ofx[t
]:
1. The dynamic range of the signalthe maximum fluctuation in the
signal defined by Equation-A.
For a signed signal we are interested in the maximum absolute value M
possible. For this example, letstake M = 1 volt.
M = max|x[t ]| over all t = 0, 1, 2, 3 . . (A)
2. The accuracy required in the representation- sometimes given as
a proportion of the maximum range.
For example, an accuracy of 100 parts per million means that each x[t ]
needs to be represented within an error of
E = M 0. 0001 = 0. 0001 volts
Suitable Representation
5/21/2018 ARM-final
93/103
93TM 9339v10 The ARM Architecture
We could use a floating-point representation for x[t ].
1)This would certainly meet our dynamic range and accuracy
constraints, and
2) it would also be easy to manipulate using the C type float.
However, most ARM cores do not support floating point in
hardware, and so a floating-point representation would be very slow.
A better choice for fast code is a fixed-point representation.
A fixed-point representation uses an integer to represent a fractional
value by scaling the fraction.
Error Vs Accuracy
5/21/2018 ARM-final
94/103
94TM 9439v10 The ARM Architecture
A common error is to think that floating point is more accurate than fixed point. Thisis false!
For the same number of bits, a fixed-point representation gives greater accuracy.
The floating-point representation gives higher dynamic range at the expense of
lower absolute accuracy.
For example, if you use a 32-bit integer to hold a fixed-point value scaled to full
range, then the maximum error in a representation is 232. However, single-
precision 32-bit floating-point values give a relative error of 224.
The single-precision floating-point mantissa is 24 bits. The leading 1 of the mantissa
is not stored, so 23 bits of storage are actually used. For values near the maximum,
the fixed-point representation is 23224 = 256 times more accurate!
The 8-bit floating-point exponent is of little use when you are interested in maximum
error rather than relative accuracy.
Better representation
5/21/2018 ARM-final
95/103
95TM 9539v10 The ARM Architecture
To summarize, a fixed-point representation is best when there is aclear bound to the strength of the signal and when maximum error is
important.
When there is no clear bound and you require a large dynamic range,then floating point is better.
You can also use the other representations, which give more dynamic
range than fixed point while still being more efficient to implement than
floating point.
General rules on writing DSPalgorithms for the ARM
5/21/2018 ARM-final
96/103
96TM 9639v10 The ARM Architecture
algorithms for the ARM.
ARM does not provide operations that saturate automatically. Design the DSP
algorithm so that saturation is not required because saturation will cost extracycles.
ARM supports extended-precision 32-bit multiplied by 32-bit to 64-bit
operations very well. Use extended-precision arithmetic or additional scaling
rather than saturation.
The ARM core is not a dedicated DSP. There is no single instruction thatissues a multiply accumulate and data fetch in parallel. However, by reusing
loaded data you can achieve a respectable DSP performance.
Design the DSP algorithm to minimize loads and stores. Once you load a data
item, then perform as many operations that use the datum as possible. You
can often do this by calculating several output results at once. Another way of increasing reuse is to concatenate several operations. For
example, you could perform a dot product and signal scale at the same time,
while only loading the data once.
Guidelines for writing DSP code.
5/21/2018 ARM-final
97/103
97TM 9739v10 The ARM Architecture
FromARM9onwards,ARMimplementations use a multistage execute pipeline
for loads and multiplies, which introduces potential processor interlocks. If you load a value and then use it in either of the following two instructions, the
processor may stall for a number of cycles waiting for the loaded value to
arrive.
Similarly if you use the result of a multiply in the following instruction, this may
cause stall cycles. It is particularly important to schedule code to avoid thesestalls.
Write ARM assembly to avoid processor interlocks. The results of load and
multiply instructions are often not available to the next instruction without
adding stall cycles.
Sometimes the results will not be available for several cycles
There are 14 registers available for general use on the ARM, r0 to r12 and r14.
Design the DSP algorithm so that the inner loop will require 14 registers or
fewer.
An example- a DOT Product
5/21/2018 ARM-final
98/103
98TM 9839v10 The ARM Architecture
A dot-product is one of the simplest DSP operations and highlights the
difference among different ARM implementations.
A dot-product combines N samples from two signals x(t) and c(t) to
produce a correlation value a: a= Ci * Xi
The C interface to the dot-product function is
int dot_product(sample *x, coefficient *c, unsigned int N); where
sample is the type to hold a 16-bit audio sample, usually a short
coefficient is the type to hold a 16-bit coefficient, usually a short
x[i] and c[i] are two arrays of length N (the data and coefficients)
the function returns the accumulated 32-bit integer dot product a
DSP rating of ARM7TDMI
5/21/2018 ARM-final
99/103
99TM 9939v10 The ARM Architecture
This example shows a 16-bit dot-product optimized for the ARM7TDMI.Each MLA takes a worst case of four cycles.
We store the 16-bit input samples in 32-bit words so that we can use the
LDM instruction to load them efficiently.
This code assumes that the number of samples N is a multiple of five.
Therefore we can use a five-word load multiple to increase data bandwidth.
The cost per load is 7/4 = 1.4 cycles compared to 3 cycles per load if we had
used LDR or LDRSH.
The inner loop requires a worst case of 7 + 7 + 5 4 + 1 + 3 = 38 cycles to
process each block of 5 products from the sum.
This gives the ARM7TDMI a DSP rating of 38/5 = 7.6 cycles per tapfor a
16-bit dot-product.
Assembly code for DOT Product
5/21/2018 ARM-final
100/103
100TM 10039v10 The ARM Architecture
x RN 0 ; input array x[]
c RN 1 ; input array c[]
N RN 2 ; number of samples (a multiple
of 5)
acc RN 3 ; accumulator
x_0 RN 4 ; elements from array x[]
x_1 RN 5
x_2 RN 6
x_3 RN 7
x_4 RN 8
c_0 RN 9 ; elements from array c[]
c_1 RN 10
c_2 RN 11
c_3 RN 12
c_4 RN 14
; int dot_16by16_arm7m(int *x, int *c, unsigned N)
dot_16by16_arm7m
STMFD sp!, {r4-r11, lr}
MOV acc, #0
loop_7m ; accumulate 5 products
LDMIA x!, {x_0, x_1, x_2, x_3, x_4}
LDMIA c!, {c_0, c_1, c_2, c_3, c_4}
MLA acc, x_0, c_0, acc
MLA acc, x_1, c_1, acc
MLA acc, x_2, c_2, acc
MLA acc, x_3, c_3, acc
MLA acc, x_4, c_4, acc
SUBS N, N, #5
BGT loop_7m
MOV r0, acc
LDMFD sp!, {r4-r11, pc}
DSP Rating -16bit DOT Product
5/21/2018 ARM-final
101/103
101TM 10139v10 The ARM Architecture
ARM7TDMI a DSP rating of 38/5 = 7.6 cycles per tap
ARM9TDMI-The inner loop requires 28 cycles per tap, giving 28/4 = 7
cycles per tap.
STRONGARM-The inner loop uses 19 cycles to process 4 taps, giving
a rating of 19/4 = 4.75 cycles per tap.
ARM9E-The inner loop requires 20 cycles to accumulate 8 products, a
rating of 20/8 = 2.5 cycles per tap.
ARM10E-The inner loop requires 25 cycles to process 10 samples, or
2.5 cycles per tap.
Intel- XSCALE-The inner loop requires 14 cycles to accumulate 8
products, a rating of 1.75 cycles per tap.
Performance improvement in DSP
5/21/2018 ARM-final
102/103
102TM 10239v10 The ARM Architecture
The block
filter
algorithm
gives a
much betterperformance
per tap if you
are
calculating
multipleproducts.
5/21/2018 ARM-final
103/103
THANK YOU