27
Dr.Y.Narasimha Murthy Ph.D [email protected] 1 ARM Processors -Architecture INTRODUCTION: The ARM Processor was originally developed at Acorn Computers Limited of Cambridge, England, between the years 1983-1985. It was the first RISC microprocessor developed for commercial use and has some significant differences from subsequent RISC architectures. In 1990 ARM Limited was established as a separate company specifically to widen the exploitation of ARM technology and it is established as a market-leader for low-power and cost-sensitive embedded applications. The basic reason behind the origin of ARM processor was, the 16-bit CISC microprocessors that were available in 1983 were slower than standard memory parts. They also had instructions that took many clock cycles to complete (in some cases, many hundreds of clock cycles), resulting very long interrupt latencies. As a result of these limitations with the commercial microprocessor offerings, the design of a proprietary microprocessor was considered hence ARM chip was emerged. In fact, ARM does not manufacture microprocessors. It is an IP(intellectual property) company that design systems and give licenses to other companies to fabricate them; for example, ARM microprocessors are manufactured by Intel, Texas Instruments, Samsung and by many other Fab companies. The ARM processor is supported by a toolkit which includes an instruction set emulator for hardware modeling and software testing and benchmarking, an assembler, C and C++ compilers, a linker and a symbolic debugger. So, ARM is not a Fab company, it only gives licenses to companies that want to manufacture ARM based CPUs or System On Chip products. The two main types of licenses offered by ARM are “Implementation Licenses and Architecture License”. The implementation license provides complete information required to design and manufacture integrated circuits containing an ARM processor core. ARM give two types of licenses. Software core and Hardware core. A hardware core is optimized for a specific manufacturing process while a soft core can be used in any process but it is less optimized. The architecture license enables the licensee to develop their own processors compliant with ARM ISA. Unique features of ARM Processors .

Arm processors' architecture

Embed Size (px)

Citation preview

Page 1: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

1

ARM Processors -Architecture

INTRODUCTION:

The ARM Processor was originally developed at Acorn Computers Limited of Cambridge,

England, between the years 1983-1985. It was the first RISC microprocessor developed for

commercial use and has some significant differences from subsequent RISC architectures. In

1990 ARM Limited was established as a separate company specifically to widen the exploitation

of ARM technology and it is established as a market-leader for low-power and cost-sensitive

embedded applications.

The basic reason behind the origin of ARM processor was, the 16-bit CISC microprocessors that

were available in 1983 were slower than standard memory parts. They also had instructions that

took many clock cycles to complete (in some cases, many hundreds of clock cycles), resulting

very long interrupt latencies. As a result of these limitations with the commercial microprocessor

offerings, the design of a proprietary microprocessor was considered hence ARM chip was

emerged.

In fact, ARM does not manufacture microprocessors. It is an IP(intellectual property) company

that design systems and give licenses to other companies to fabricate them; for example, ARM

microprocessors are manufactured by Intel, Texas Instruments, Samsung and by many other

Fab companies.

The ARM processor is supported by a toolkit which includes an instruction set emulator for

hardware modeling and software testing and benchmarking, an assembler, C and C++ compilers,

a linker and a symbolic debugger.

So, ARM is not a Fab company, it only gives licenses to companies that want to manufacture

ARM based CPUs or System On Chip products. The two main types of licenses offered by ARM

are “Implementation Licenses and Architecture License”. The implementation license provides

complete information required to design and manufacture integrated circuits containing an ARM

processor core. ARM give two types of licenses. Software core and Hardware core. A hardware

core is optimized for a specific manufacturing process while a soft core can be used in any

process but it is less optimized.

The architecture license enables the licensee to develop their own processors compliant with

ARM ISA.

Unique features of ARM Processors.

Page 2: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

2

Because of certain unique features, today ARM has become one of the most popular embedded

architecture.

(i).ARM cores are simple compared to most other general purpose processors. i.e they can be

manufactured using comparatively less number of transistors, leaving plenty of space on the chip

for application specific macro cells.

(ii).A typical ARM chip can contain several peripheral controllers, a digital signal processor, and

some amount of on chip memory along with an ARM core.

Both ARM ISA and pipeline design are aimed at minimizing energy consumption, which is a

critical requirement in mobile embedded systems.

(iii).ARM architecture is highly modular i.e the only mandatory component of an ARM

processor is Integer Pipe line. All other components including caches, MMU, Floating Point and

other Co-Processors are optional, which gives a lot of flexibility in building application specific

ARM based processors.

Also, being small and low-power, these ARM processors provide high performance for

embedded applications.

For Ex:PXA255 Xscale processor running at 400MHz provides performance comparable to

Pentium2 at 300MHz while using 50 times less energy.

ARM is basically a RISC architecture processor which incorporated a number of features from

the Berkeley RISC design, but a number of other features were rejected.

The RISC features used were:

•A load-store architecture.

• Fixed-length 32-bit instructions: All instructions have only a fixed length of 32 bits.

• All arithmetic and Logic instructions operate on the operands in the processor registers.

(3-address instructions- Two source operand registers and one destination register all are

independently specified. Ex: ADD r0, r1, r2).

The RISC features those were rejected are:

Register windows, Delayed branches and Single-cycle execution of all instructions etc.

The main problem with register windows is the large chip area occupied by the large

number of registers. This feature was therefore rejected on cost grounds.

The problem with delayed branches is that they remove the atomicity of individual

instructions. They work well on single issue pipelined processors, but they do not scale

Page 3: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

3

well to super-scalar implementations and can interact badly with branch prediction

mechanisms.

Although the ARM executes most data processing instructions in a single clock cycle, many

other instructions take multiple clock cycles. Instead of single-cycle execution of all instructions,

the ARM was designed to use the minimum number of cycles required for memory accesses.

Where this was greater than one, the extra cycles were used, where possible, to do something

useful, such as support auto-indexing addressing modes. This reduces the total number of ARM

instructions required to perform any sequence of operations, improving performance and code

density.

ARM's architecture is compatible with all four major platform operating systems:

Symbian OS, Palm OS, Windows CE, and

Linux.

Special ISA (Instruction Set Architecture) features of ARM

ARM has certain interesting features which are not found in other processors.

(i).Conditional execution of Instructions: All instructions are conditionally executed. i.e an Instruction is

executed only if the current values of the condition code flags.

Ex: ADDNE r1, r2, r3 i.e Add the registers r2 and r3, if they are not equal and keep the result in

the register r1.If the condition is not satisfied, the instruction acts as a NOP.

This feature was chosen because it could maintain high performance while reducing hardware

complexity since it could avoid introducing pipeline bubbles and compensate for the lack of a

branch predictor.

On the same lines the instructions can use: N-Negative, Z-Zero, C-Carry, and V-Overflow flags

in the Current Program Status Register (CPSR) satisfy the Condition specified in a 4-bit field

of the instruction.

For example let us write a program based on Conditional execution of ARM instructions.

Ex: Loop : CMP r0,r1

SUBGT r0,r0,r1 SUBLT r1,r1,r0 BNE Loop.

The program has only 4 –instructions. Let us now consider the example of normal program.

Loop : CMP r0,r1 BEQ end BLT less

Sub r0,r0,r1

Page 4: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

4

B Loop Less : Sub r1,r1,r0

B Loop End.

From the above example it is clear that there are total 7 Instructions (2 conditional branch and 2 unconditional Jump instructions). So, implementation using conditional execution generates shorter code and increases execution

speed. Also an instruction has only its normal effect if the status satisfies a condition specified in the instructions, otherwise the instruction acts as a NOP.

Another unusual architectural feature is in ARM is Shift Instructions are not provided explicitly in ARM. However an immediate value or one of the register operands in Arithmetic, Logic and

Move instructions can be shifted by a prescribed amount before being used in an operation. Consider the following ARM instruction with r1 = 3 and r2 = 5

ADD r0,r1, r2, LSL#3 ; r0= r1 + (8 x r2) which is r0 = 3+ (8x5) =43

Consider a MOV instruction with r1 = 168 and r2 = 3:

MOV r0,r1,LSR r2; Shift the binary value of 168, 3 places to right

168=0000 0000 0000 0000 0000 0000 1010 1000

Shifted 3 places

Becomes 0000 0000 0000 0000 0000 0000 0001 0101

r0:=21

For positive numbers, LSR 3 is the same as dividing by 2 ^ 3 (8)

This feature is used to implement shift instructions implicitly.

Though there are different numbers of multiply instructions for use in signal processing applications, there are no hardware Divide instructions. Division must be implemented in

software.

ARM was one of the first architectures to implement load-store multiple instructions. These can

transfer multiple registers between memory and processor in a single instruction.

ARM processor include an inline barrel shifter to pre-process one of the input registers. This

barrel shifter helps in executing arithmetic instructions like multiplication and multiply accumulate etc.

The simplicity in architecture reduces the overhead on each instruction allowing the clock cycles to be shortened.

ARM 7TDMI-S Processor : The ARM7TDMI-S processor is a member of the ARM family of

general-purpose 32-bit microprocessors. The ARM family offers high performance for very low-

Page 5: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

5

power consumption and gate count. The ARM7TDMI-S processor has a Von Neumann

architecture, with a single 32-bit data bus carrying both instructions and data. Only load, store,

and swap instructions can access data from memory. The ARM7TDMI-S processor uses a three

stage pipeline to increase the speed of the flow of instructions to the processor. This enables

several operations to take place simultaneously, and the processing, and memory systems to

operate continuously. In the three-stage pipeline the instructions are executed in three stages.

The three stage pipelined architecture of the ARM7 processor is shown in the above figure.

ARM7TDMIS stands for

T: THUMB MODE(16 bit instruction support)

D for on-chip Debug support, enabling the processor to halt in response to a debug request,

M: enhanced Multiplier, yield a full 64-bit result, high performance

I: Embedded ICE hardware (In Circuit emulator).The Embedded ICE macro cell consists of on-

chip logic to support debug operations.

S: Synthesizable.

[Here let me tell you the meaning of Synthesizable: In early days ARMs were designed as a hard

macro,i.e the physical design at the transistor layout level, and the fab companies were taking

this fixed physical block and used to place it into their chip designs. But due to the

complexities,a demand increased for a more flexible and configurable solution, hence ARM

moveddecided to deliver processor designs as a behavioral description at the "register transfer

level" (RTL) written in a hardware description language (HDL), typically Verilog HDL. The

process of converting this behavioral description into a physical network of logic gates is called

"synthesis", and several major EDA companies sell automated synthesis tools for this purpose.

Aprocessor design distributed to licensees as an RTL description (such as ARM7TDMI-S) is

therefore described as "synthesizable"]

ADDITIONAL FEATURES OF ARM PROCESSORS

Page 6: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

6

The ARM processors are based on RISC architectures and this architecture has provided small

implementations, and very low power consumption. Implementation size, performance, and very

low power consumption remain the key features in the development of the ARM devices.

The typical RISC architectural features of ARM are

(i).A large uniform register file all of which can be used for most purposes.

(ii).A load/store architecture, where data-processing operations only operate on register contents,

not directly on memory contents. Only Load /Store instructions access memory.

For ex: LDR r0,[r1] ; STR r0, [r1]; LDREQB r0,[r1]: conditional

(iii).A 3-address instructions (Two source operand registers and the result register all are

independently specified)

(iv).Simple addressing modes, with all load/store addresses being determined from register

contents and instruction fields only uniform and fixed-length instruction fields, to simplify

instruction decode.

(v).The ability to perform a general shift operation and a general ALU operation (using a

hardware barrel shifter) in a single instruction that executes in a single clock cycle.

(vi). Auto-increment and auto-decrement addressing modes to optimize program loops

(vii).Load and Store Multiple instructions to maximize data throughput

(viii)Conditional execution of almost all instructions to maximize execution throughput.

(ix).A very dense 16-bit compressed representation of the instruction set in the Thumb

Architecture.

ARM architecture is compatible with all four major operating systems, i.e.

Symbian OS,

Palm OS,

Windows and Android OS.

There are three basic instruction sets for ARM.

A 32- bit ARM instruction set

A 16 –bit Thumb instruction set and

The 8-bit Java Byte code used in Jazelle state.

[This is supported by ARM9 processors and above. For this either the J bit in CSR

register must be set or a branch instruction BXJ is executed. This will help to increase the

execution speed of Java ME(Java Micro Edition)games and applications. As Java

Page 7: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

7

applications get run in hardware (rather than Software) more speed is achieved.This light

weight version of Java runs on limited memory and /or processing power such as

Cellular phones,PDAs,TVset-top boxes , smart cards etc.

Even though the Jazelle adds a lot of functionality to the already existing ARM core, only

about 20,000 additional gates are needed, a value that is almost insignificant for a typical

ARM CPU macro cell product, that also includes the cache required to support the

operating system].

The Thumb instruction set is a subset of the most commonly used 32-bit ARM instructions.

Thumb instructions operate with the standard ARM register configurations, enabling excellent

interoperability between ARM and Thumb states. This Thumb state is nearly 65% of the ARM

code and can provide 160% of the performance of ARM code when working on a 16-bit memory

system. This Thumb mode is used in embedded systems where memory resources are limited.

**Additional Explanation: Comparison of Thumb and ARM instructions.

Here I consider ARM assembly code and Thumb code in the program given below.

ARM Assembly Thumb Assembly

.abs ; return the absolute value of integer parameter .abs

iabs CMP r0,#r0 iabs CMP r0,#0

RSBLT r0,r0,#0;if r0 is less than zero,set r0 to 0-r0 BGE return ;

MOV PC,lr ; return from a linked branch NEG ro, ro ;

return MOV PC , lr ;

Let us now code density for both the codes.

Code Instructions Size(Bytes) Normalized

ARM 3 12 1.0

THUMB 4 8 0.67

So, the thumb code is nearly 33% denser than ARM code for the same function.

**In the above ARM code, the last line, instead of MOV Pc,lr can also be MOV r15,r14 .Why?

guess the reason !!

ARCHITECTURE OF ARM PROCESSORS:

The ARM 7 processor is based on Von Neumann model with a single bus for both data and

instructions.(The ARM 9 is based on Harvard model).Though this will decrease the

Page 8: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

8

performance of ARM 7, it is overcome by the pipe line concept. ARM uses the Advanced

Microcontroller Bus Architecture (AMBA) bus architecture. This AMBA include two system

buses: the AMBA High-Speed Bus (AHB) or the Advanced System Bus (ASB), and the

Advanced Peripheral Bus (APB).

The ARM processor consists of

Arithmetic Logic Unit (32-bit)

One Booth multiplier(32-bit)

One Barrel shifter

One Control unit

Register file of 37 registers each of 32 bits.

The barrel shifter is used for fast shift operations and can perform the necessary processing of

register values before it enters the ALU. This helps the easy calculation of wider ranges of

expressions and addresses.

In addition to this the ARM also consists of a Program status register of 32 bits, Some

special registers like the instruction register, memory data read and write register and

memory address register ,one Priority encoder which is used in the multiple load and

store instruction to indicate which register in the register file to be loaded or stored and

Multiplexers etc.

Page 9: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

9

ARM Registers: ARM has a total of 37 registers. In which - 31 are general-purpose registers of

32-bits, and six status registers .But all these registers are not seen at once. The processor state

and operating mode decide which registers are available to the programmer. At any time,

among the 31 general purpose registers only 16 registers are available to the user. The remaining

15 registers are used to speed up exception processing. There are two program status registers:

CPSR and SPSR (the current and saved program status registers, respectively)

In ARM state the registers r0 to r13 are orthogonal—any instruction that you can apply to r0 you

can equally well apply to any of the other registers.

The main bank of 16 registers is used by all unprivileged code. These are the User mode

registers. User mode is different from all other modes as it is unprivileged. In addition to this

register bank, there is also one 32-bit Current Program status Register (CPSR)

Page 10: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

10

In the 15 registers, the r13 acts as a stack pointer register and r14 acts as a link register and r15

acts as a program counter register.

Register r13 is the SP (Stack Pointer) register, and it is used to store the address of the stack top.

R13 is used by the PUSH and POP instructions in T variants, and by the SRS and RFE

instructions from ARMv6.

Register r14 is the Link Register (LR). This register holds the address of the next instruction

after a Branch and Link (BL or BLX) instruction, which is the instruction used to make a

subroutine call. It is also used for return address information on entry to exception modes. At all

other times, r14 can be used as a general-purpose register.

**You may get a doubt why this link register is added in the ARM architecture and what is

its advantage? In fact in CISC (Intel) processors when an interrupt occurs the return

address is always stored on stack. So, after providing the interrupt service the processor

has to access the stack which normally takes more time than accessing a register. So, if the

return address is stored in a Link register then accessing the link register takes less time.

This is the advantage of Link register.

Register r15 is the Program Counter (PC). It can be used in most instructions as a pointer to the

instruction which is two instructions after the instruction being executed .

**The PC in ARM has a specialty. In normal CISC (Intel) processors the PC normally stores the

address of next instruction to be executed. But the ARM PC contains the address of the instruction

that is being fetched (not the one being executed) .

The remaining 13 registers have no special hardware purpose.

CPSR: The ARM core uses the CPSR register to monitor and control internal operations. The

CPSR is a dedicated 32-bit register and resides in the register file. The CPSR is divided into four

fields, each of 8 bits wide: flags, status, extension, and control. The extension and status fields

Page 11: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

11

are reserved for future processors like ARMV5 and ARMV7 etc. The control field contains the

processor mode, state, and interrupts mask bits. The flags field contains the condition flags. The

32-bit CPSR register is shown below.

M4 M3 M3 M2 M1 Mode

0 0 0 0 0 User 26 mode

0 0 0 0 1 FIQ 26 Mode

0 0 0 1 0 IRQ 26 Mode

0 0 0 1 1 SVC 26 Mode

1 0 0 0 0 User Mode

1 0 0 0 1 FIQ mode

1 0 0 1 0 IRQ Mode

1 0 0 1 1 SVC Mode

1 0 1 1 1 ABT Mode

1 1 0 1 1 UND Mode

1 1 1 1 1 System Mode

F

FIQ disable bit: 1 = FIQ interrupts disabled 0 = FIQ interrupts enabled.

I IRQ disable bit: 1 = IRQ interrupts disabled

0 = IRQ interrupts enabled

V

Negative or less than flag:

1 = result negative or less than in last operation 0 = result positive or greater than.

Page 12: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

12

T Thumb state flag: 1 = processor operating in Thumb state 0 = processor operating in ARM state.

The CPSR in Higher versions

Q flag is set in E variants of of ARMv5 and above to indicate underflow and/or saturation is

used in instructions intended to assist DSP operations.

GE[3:0] flags, in ARMv6, control the Greater than or Equal behavior in SIMD instructions.

For half word instructions, if bits 3:2 are set, the upper half word is used; and if bits 1:0 are set,

the lower half word is set. Similarly, for byte operations, if bit 3 is set, the top byte is used; if bit

0 is set, the bottom byte is used; and bits/bytes 2 and 1 in the same fashion.

E: is a flag in ARMv6 that controls the 'endianness' for data handling.

With increasing system on a chip (SoC) integration, a single chip is more likely to contain little-

endian OS environments and interfaces (such as USB, PCI), but with bigendiandata (TCP/IP

packets, MPEG streams). With ARMv6, support for mixed-endiansystems has been improved.

As a result, handling data in mixed-endian systems under ARMv6 is far more efficient.

The ARM added the J bit to the CPSR .The J bit records whether the processor is in Java, ARM

or Thumb state.

When J=1,T=1 it is illegal and

When J=T=0 The processor will be in ARM mode

When J=1, T=0 , The processor is in Java state.

But when J=0, T=1 the processor is in Thumb State.

Basically to enter in the Java state simply write the J bit of CPSR, but it is not recommended.

Instead of this use Branch Exchange to Java (BXJ) instruction. It works just like calling a

subroutine.

Page 13: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

13

This single instruction saves three program steps. Because BXJ performs three operations.

First it checks the condition .If the condition is true it will store it in the Pc and load a new Pc.

Then it will store it in the Pc and load a new Pc .Then it will set the Java state and takes a branch.

CPSR in Cortex Processors

Do Not Modify (DNM) must not be modified by software.

The IT execution state bits

IT[7:5] encodes the base condition code for the current IT block, if any. It contains b000 when no

IT block is active.

IT[4:0] encodes the number of instructions that are to be conditionally executed, and whether the

condition for each is the base condition code or the inverse of the base condition code. It contains

b00000 when no IT block is active.

SPSR Register: The SPSR is used to store the current value of the CPSR when an exception occurs so that it can be restored after handling the exception. Each exception handling mode can

access its own SPSR. User mode and System mode do not have an SPSR because they are not exception handling modes.

Processor Modes: There are seven processor modes. Six privileged modes abort, fast interrupt

request, interrupt request, supervisor, system, and undefined and one un-privileged mode called

user mode.

i.The processor enters abort mode when there is a failed attempt to access memory.

ii.Fast interrupt request and iii. interrupt request modes correspond to the two interrupt levels

available on the ARM processor.

Page 14: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

14

iv. Supervisor mode is the mode that the processor is in after reset and is generally the mode that

an operating system kernel operates in.

v. System mode is a special version of user mode that allows full read-write access to the CPSR.

vi.Undefined mode is used when the processor encounters an instruction that is undefined or not

supported by the implementation.

vii.User mode is used for programs and applications.

The T bit Decides processor state, either 16 bit Thumb state or 32 bit Arm state. When the T bit is

1, then the processor is in Thumb state. To change states the core executes a specialized branch

instruction and when T= 0 the processor is in ARM state and executes ARM instructions.

**So, the processor mode can be changed by a program that writes directly to CPSR (the

processor has to be in privileged mode) or by hardware when core responds to an exception or interrupt.

Banked Registers: Out of the 32 registers, 20 registers are hidden from a program at different

times. These registers are called banked registers. They are available only when the processor is

in a particular mode; for example, abort mode has banked registers r13_abt , r14_abt and

spsr _abt. Banked registers of a particular mode are denoted by an underline character post-fixed

to the mode mnemonic or _mode.

Page 15: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

15

Any banked register is unique to its particular state and would actually be a different physical

memory location even though the instruction address to write to it would be the same regardless

of mode. For example, if you wanted to write to r13 in whatever mode, you would use r13 =

some_ value, and not actually specify the unique name r13, r13_fiq, r13_svc, etc. So if you wrote

a value into r13 while in User mode then switched to FIQ mode, the value in r13 User mode

would not be available. While in FIQ mode, you could write to register r13 again and not impact

the value you wrote during user mode.

There are two interrupt request levels available on the ARM processor core- interrupt request

(IRQ) and Fast Interrupt request (FIQ).

At the CPU level, the ARM FIQ signal is technically very similar to the x86 non-maskable

interrupt (NMI), but its role within the system architecture has different historical roots. ARM

Page 16: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

16

FIQs were, as the name suggests, designed to rapidly service demanding peripherals or even to

allow software to replace hardware (for example in synchronous serial communication).

**Here an interesting point to understand is how FIQ provides faster service? The answer is

simple. From banked registers it is clear that more registers (r8-r14) are banked with this FIQ

mode and hence this need not use stack to store any values, rather can use its registers .As

accessing registers is always faster than stack, it provides faster service to interrupt requests.

The IRQ exception is a normal interrupt caused by a LOW level on the IRQ input. IRQ has a

lower priority than FIQ, and is masked on entry to an FIQ sequence. It must ensured that the

IRQ input is held LOW until the processor acknowledges the interrupt request, either from the

VIC (Vectored Interrupt Controller) interface or the software handler.

V, C, Z , N are the Condition flags .

V(oVerflow) : Set if the result causes a signed overflow. This flag is set whenever the result of

a signed number operation is too large, causing the high order bit to overflow into

the sign bit. Generally carry flag is used to detect errors in unsigned arithmetic

operations while the overflow is used to detect errors in signed arithmetic

operations.

C (Carry) : Is set when the result causes an unsigned carry

Z (Zero) : This bit is set when the result after an arithmetic operation is zero, frequently

used to indicate equality

N (Negative) : It is the sign bit used to represent the binary signed .This bit is set when the bit

31 of the result is a binary 1.Binary representation of signed numbers uses D31

as the sign bit .If the D31 bit of the result is zero ,then N=0 and the result is

positive. If D31 bit is one ,then N=1 and the result is negative. The negative and

V flag are used for the signed number arithmetic operations .

Note: The biggest register difference involves is the SP register. The Thumb state has unique

stack mnemonics (PUSH, POP) that don't exist in the ARM state. These instructions assume the

existence of a stack pointer, for which R13 is used. They translate into load and store instructions

in the ARM state.

Page 17: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

17

THUMB Mode Secrets : Actually the Thumb mode instructions are only 16-bit instructions. But

how the ARM processor gives both code density advantage and the same 32 bit higher

performance at the same time?

Let us understand this point in detail. Actually the ARM design has a special block which

decompresses the Thumb code into ARM code before it enters into execution (Thumb instruction

decompressor) in addition to ARM instruction Decoder. This can be found in the following

Block diagram.

The ARM instructions arriving from the Fetch stage of the pipe line pass through the ARM

decoder, and activate major and minor opcode bit control signals.

Major opcode bits describe the type of instructions to execute while minor bits specify

instruction details such as the registers or operand specified.

Page 18: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

18

In Thumb state, multiplexers direct Thumb instructions through the Thumb Decompression

logic. This effectively expands the thumb instructions into its equivalent ARM instructions.

The execution of ARM instruction takes place as usual

The major code of the Thumb instruction denotes the type of instruction, in the above example it

is an Arithmetic instruction. The minor opcode specifies the type of arithmetic operation. i.e

ADD between a register & constant. In ARM instructions have space for 4 registers, the value is

expanded by a zero.

Page 19: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

19

PIPE LINE : Pipeline is the mechanism used by the RISC processor to execute instructions at

an increased speed. This pipeline mechanism speeds up execution by fetching the next

instruction while other instructions are being decoded and executed. During the execution of an

instruction ,the processor Fetches the instruction .It means loads an instruction from

memory.And decodes the instruction i.e identifies the instruction to be executed and finally

Executes the instruction and writes the result back to a register.

The ARM7 processor has a three stage pipelining architecture namely Fetch, Decode and

Execute.

ARM 9 has five pipe line stages, ARM10 has 6 and ARM11 has 8 pipe line stage architecture .

The three stage pipelining is explained as below.

Fig: ARM 7 Core 3-Stage Pipe Lining

To explain the pipelining ,let us consider that there are three instructions Compare, Subtract and

Add. The ARM7 processor fetches the first instruction CMP in the first cycle and during the

second cycle it decodes the CMP instruction and at the same time it will fetch the SUB

Page 20: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

20

instruction. During the third cycle it executes the CMP instruction , while decoding the SUB

instruction and also at the same time will fetch the third instruction ADD. This will improve the

speed of operation. This leads to the concept of parallel processing .This pipeline example is

shown in the following diagram.

As the pipeline length increases, the amount of work done at each stage is reduced, which allows

the processor to attain a higher operating frequency. This in turn increases the performance. One

important feature of this pipeline is the execution of a branch instruction or branching by the

direct modification of the PC causes the ARM core to flush its pipeline.

Exceptions, Interrupts, and the Vector Table

Exceptions are generated by internal and external sources to cause the ARM processor to handle

an event, such as an externally generated interrupt or an attempt to execute an Undefined

instruction. The processor state just before handling the exception is normally preserved so that

the original program can be resumed after the completion of the exception routine. More than

one exception can arise at the same time.ARM exceptions may be considered in three groups

1. Exceptions generated as the direct effect of executing an instruction.Software interrupts,

undefined instructions (including coprocessor instructions where the requested coprocessor is

absent) and prefetch aborts (instructions that are invalid due to a memory fault occurring during

fetch) come under this group.

2. Exceptions generated as a side-effect of an instruction. Data aborts (a memory fault during a

load or store data access) are in this group.

3. Exceptions generated externally, unrelated to the instruction flow.Reset, IRQ and FIQ are in

this group.

The ARM architecture supports seven types of exceptions.

Page 21: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

21

i.Reset

ii.Undefined Instruction

iii.Software Interrupt(SWI)

iv. Pre-fetch abort(Instruction Fetch memory fault)

v.Data abort (Data access memory fault)

vi. IRQ(normal Interrupt)

vii. FIQ (Fast Interrupt request).

When an Exception occurs , the processor performs the following sequence of actions:

• It changes to the operating mode corresponding to the particular exception.

• It saves the address of the instruction following the exception entry instruction in r14 of the

new mode.

• It saves the old value of the CPSR in the SPSR of the new mode.

• It disables IRQs by setting bit 7 of the CPSR and, if the exception is a fast interrupt, disables

further fast interrupts by setting bit 6 of the CPSR.

• It forces the PC to begin executing at the relevant vector address

Excdption / Interrupt Name Address High Address

Reset RESET 0X00000000 0Xffff0000

Undefined Instruction UNDEF 0X00000004 0Xffff0004

Software Interrupt SWI 0X00000008 0Xffff0008

Pre-fetch Abort PABT 0X0000000C 0Xffff000c

Data Abort DABT 0X00000010 0Xffff0010

Interrupt Request IRQ 0X00000018 0Xffff0018

Fast Interrupt Request FIQ 0X0000001C 0Xffff001c

The exception Vector table shown above gives the address of the subroutine program to be

executed when the exception or interrupt occurs. Each vector table entry contains a form of

branch instruction pointing to the start of a specific routine.

In the above table one can see the missing of 0X00000014 address .This location was used on

earlier ARM processors which operated within a 26-bit address space to trap load or store addresses which fell outside the address space. These traps were referred to as 'address exceptions'. Since 32-bit ARMs do not generate addresses which fall outside their 32-bit

address space, address exceptions have no role in the current architecture and the vector address at 0x00000014 is unused.

Similarly some ARM vendors use the Vector table at more than one memory locations .Hence you have two address locations (Address and High address).This depend on the type and configuration of the ARM processor.

Page 22: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

22

Reset vector is the location of the first instruction executed by the processor when power is applied. This instruction branches to the initialization code.

Undefined instruction vector is used when the processor cannot decode an instruction.

Software interrupt vector is called when you execute a SWI instruction. The SWI instruction is

frequently used as the mechanism to invoke an operating system routine.

Pre-fetch abort vector occurs when the processor attempts to fetch an instruction from an address

without the correct access permissions. The actual abort occurs in the decode stage.

Data abort vector is similar to a prefetch abort but is raised when an instruction attempts to

access data memory without the correct access permissions. Interrupt request vector is used by external hardware to interrupt the normal execution flow of

the processor. It can only be raised if IRQs are not masked in the CPSR. The Thumb programmer's model ARM cores after reset, start executing ARM instructions. The normal way they switch to

execute Thumb instructions is by executing a Branch and Exchange instruction (BX). The Thumb instruction set is a subset of the ARM instruction set and the instructions operate on

a restricted view of the ARM registers. i.e all the registers are not available in Thumb mode. Only registers r0 –r7 (Low registers) and special function registers (r13-r15)are available in Thumb mode.

r13 is used as a stack pointer.

r14 is used as the link register.

r15 is the program counter (PC).

The CPSR condition code flags are set by arithmetic and logical operations and control conditional branching.

Page 23: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

23

Salient Features of THUMB

Most Thumb instructions are executed unconditionally. Many Thumb data processing instructions use a 2-address format (the destination register

is the same as one of the source registers). Thumb instruction formats are less regular than ARM instruction formats, as a result of

the dense encoding.

Exceptions generated during Thumb execution switch to ARM execution before executing the

exception handler.

The state of the T bit is preserved in the SPSR, and the LR of the exception mode is set so that

the normal return instruction performs correctly, regardless of whether the exception occurred

during ARM or Thumb execution.

The higher registers r8 to r12 are only accessible with MOV, ADD, or CMP instructions.

CMP and all the data processing instructions that operate on low registers update the condition

flags in the CPSR.

Also, there are no MSR and MRS equivalent Thumb instructions. To alter the CPSR or SPSR,

one must switch into ARM state to use MSR and MRS. Similarly, there are no coprocessor

instructions in Thumb state.

From ARMv4T to ARMv7-A there are two instruction sets: ARM and Thumb.

They are both "32-bit" in the sense that they operate on up-to-32-bit-wide data in 32-bit-wide

registers with 32-bit addresses.

In fact, where they overlap they represent the exact same instructions - it is only the instruction

encoding which differs, and the CPU effectively just has two different decode front-ends to its

pipeline which it can switch between. Thumb-2 encompassed not just adding more instructions

to Thumb (mostly with 4-byte encodings) to bring it almost to parity with ARM, but also

extending the execution state to allow for conditional execution of most Thumb instructions, and

finally introducing a whole new assembly syntax (UAL, "Unified Assembly Language") which

replaced the previous separate ARM and Thumb syntaxes and allowed writing code once and

assembling it to either instruction set without modification

The Cortex-M architectures implement only the Thumb instruction set -ARMv7-M (Cortex-

M3/M4/M7) supports most of "Thumb-2 Technology", including conditional execution and

Page 24: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

24

encodings for VFP instructions, whereas ARMv6-M (Cortex-M0/M0+) only uses Thumb-2 in

the form of a handful of 4-byte system instructions.

ARM-Thumb transfer instructions:

(i). BX Rm

Thumb version branch exchange pc = Rm & 0xfffffffe, T = Rm[0]

(ii). BLX Rm ; Thumb version branch exchange with link pc = Rm & 0xfffffffe, T = Rm[0]

lr = address of next instruction after BLX+1 Example1: ARM code CODE32 ; word aligned

LDR r0, =thumbCode+1 address (thumbCode)= 0x00009000 ; r0 = 0x00009001 BLX r0 ; branch to Thumb code & mode

Example 2: Thumb code

CODE16 ; halfword aligned Thumb Code

ADD r1, #1 BX lr ; branch to ARM code & mode

Co-Processor Interface: ARM 7 supports for up to 16 logical Coprocessors. The introduction of this concept is mainly aimed at improving the performance of ARM processor.Each coprocessor

can have up to 16 private registers of any size without limiting to 32 bits.

Co-processors use load/store architecture.

The ARM7TDMI Co-processor is based on “Bus Watching”

The Co-processor is attached to a a bus where ARM instruction stream flows into ARM

and the coprocessor copies the instructions into an internal pipeline that is similar to ARM instruction pipe line.

There are three hand shake signals between ARM and the co-processor before execution

of instructions. (i).CPI(From ARM to all Co-processors):Co-processor instruction. Indicates that ARM has

identified a co-processor instruction and wishes to execute it.

(ii).CPA(From Co-processor to ARM):Co-processor absent, which tells the ARM that there is no

ARM co-processor present that is able to execute the current instruction.

(iii).CPB(From the co-processor to ARM):Co-processor busy signal which tells the ARM that

the co-processor cannot begin executing the instruction set.

The timing is such that the ARM and co-processor must generate their respective signals

automatically.

Page 25: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

25

ARM Processor Families

There are various ARM processors available in the market for different application .These are

grouped into different families based on the core .These families are based on the ARM7,

ARM9, ARM10, and ARM11 cores. The numbers 7, 9, 10, and 11 indicate different core

designs. The ascending number indicates an increase in performance and sophistication.

Though ARM 8 was introduced during 1996, it is no more available in the market. The

following table gives a brief comparison of their performance and available resources.

The ARM7 core has a Von Neumann–style architecture, where both data and instructions use the

same bus. The core has a three-stage pipeline and executes the architecture ARMv4T instruction

set. The ARM7TDMI was introduced in 1995 by ARM. It is currently a very popular core and is

used in many 32-bit embedded processors.

The ARM9 family was released in 1997. It has five stage pipeline architecture. Hence, the

ARM9 processor can run at higher clock frequencies than the ARM7 family. The extra stages

improve the overall performance of the processor. The memory system has been redesigned to

follow the Harvard architecture, with separate data and instruction .buses. The first processor in

the ARM9 family was the ARM920T, which includes a separate D + I cache and an MMU. This

processor can be used by operating systems requiring virtual memory support. ARM922T is a

variation on the ARM920T but with half the D +I cache size.

The latest core in the ARM9 product line is the ARM926EJ-S synthesizable processor core,

announced in 2000. It is designed for use in small portable Java-enabled devices such as 3G

phones and personal digital assistants (PDAs).

The ARM10 was released in 1999. It extends the ARM9 pipeline to six stages. It also supports an

optional vector floating-point (VFP) unit, which adds a seventh stage to the ARM10 pipeline.

The VFP significantly increases floating-point performance and is compliant with the IEEE

754.1985 floating-point standard.

The ARM1136J-S is the ARM11 processor released in the year 2003 and it is designed for high

performance and power efficient applications. ARM1136J-S was the first processor

implementation to execute architecture ARMv6 instructions. It incorporates an eight-stage

pipeline with separate load store and arithmetic pipelines.

In 2004, ARM introduced its new Cortex family of processors.

Page 26: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

26

The Cortex processor family is subdivided into three different profiles.Cortex-A, Cortex-M and

Cortex-R. Each profile is optimized for different segments of embedded systems applications.

A denotes Application, M denotes Microcontroller and R denotes Real Time. The Cortex-A profile has been designed as a high-end application processor. Cortex-A processors

are capable of running feature-rich operating systems such as WinRT and Linux.The key

applications for Cortex-A are consumer electronics such as smart phones, tablet computers, and

set-top boxes.

Unlike earlier ARM CPUs, the Cortex-M processor family is designed specifically for use within

a small microcontroller.

The Cortex-M processor comes in five variants: Cortex-M0, Cortex-M01, Cortex-M1, Cortex-

M3, and Cortex-M4. The Cortex-M0 and Cortex-M01 are the smallest processors in the family.

This helps the manufacturers to design low-cost, low-power devices that can replace existing 8-

bit microcontrollers while still offering 32-bit performance.

The Cortex-M1 has much of the same features as the Cortex-M0 but has been designed as a “soft

core” to run inside a Field Programmable Gate Array (FPGA) device.

The highest performing member of the Cortex-M family is the Cortex-M4.This has all the

features of the Cortex-M3 and adds support for digital signal processing (DSP) and also includes

hardware floating point support for single precision calculations.

The third Cortex profile is Cortex-R. This is the real-time profile that delivers a high-

performance processor which is the heart of an application specific device.

Very often a Cortex-R processor forms part of a “system-on-chip” design that is focused on a

specific task such as hard disk drive (HDD) control, automotive engine management, and

medical devices. The Arm Cortex-R real-time processors offer high-performance computing

solutions for embedded systems where reliability, high availability, fault tolerance and/or

deterministic real-time responses are needed.

Cortex-R processors are used in products where performance requirements and timing deadlines

must always be met.

In addition, Cortex-R processors are used in electronic systems which must be functionally safe

to avoid hazardous situations, for example, in medical applications or autonomous systems.

Page 27: Arm processors'   architecture

Dr.Y.Narasimha Murthy Ph.D [email protected]

27

ARM recently (2017) unveiled its next-generation CPU cores, the CORTEX A75 and CORTEX

A55, which are the first processors to support the company’s new DynamIQ multi-core

technology.

This a set of new processors provide the brainpower to the mobile devices to cope with

advanced artificial intelligence (AI), virtual reality (VR), and mixed reality (MR) technologies.

The A75 is the successor to ARM’s high performance A73 and A72, while the new Cortex-A55

is a more power efficient replacement for the popular Cortex-A53.

Cortex-A75 is the new flagship-tier mobile processor design, with a claimed 22 percent

improvement in performance over the incumbent A73.

It’s joined by the new Cortex A-55, which has the highest power efficiency of any mid-range

CPU ARM’s ever designed, and the Mali-G72 graphics processor, which also comes with a 25

percent improvement in efficiency relative to its predecessor G71.

A brief comparison of different ARM families is presented below:

***********