57
ARM Processor ARM Processor Architecture (I) Architecture (I) Speaker: Lung-Hao Chang 張張張 Advisor: Porf. Andy Wu 張張張張張 Graduate Institute of Electronics Engineering, National Taiwan University Modified from National Chiao-Tung University IP Core Design course

ARM Processor Architecture (I)

Embed Size (px)

DESCRIPTION

ARM Processor Architecture (I). Speaker: Lung-Hao Chang 張龍豪 Advisor: Porf. Andy Wu 吳安宇教授 Graduate Institute of Electronics Engineering, National Taiwan University. Modified from National Chiao-Tung University IP Core Design course. Outline. Thumb instruction set - PowerPoint PPT Presentation

Citation preview

Page 1: ARM Processor Architecture (I)

ARM Processor ARM Processor Architecture (I)Architecture (I)

Speaker: Lung-Hao Chang 張龍豪Advisor: Porf. Andy Wu 吳安宇教授

Graduate Institute of Electronics Engineering,

National Taiwan University

Modified from National Chiao-Tung University IP Core Design course

Page 2: ARM Processor Architecture (I)

2SOC Consortium Course MaterialARM Platform Design 09/21/2003

Outline

Thumb instruction setARM/Thumb interworkingARM organizationSummary

Page 3: ARM Processor Architecture (I)

3SOC Consortium Course MaterialARM Platform Design 09/21/2003

Thumb instruction set

Page 4: ARM Processor Architecture (I)

4SOC Consortium Course MaterialARM Platform Design 09/21/2003

Thumb-ARM Difference

Thumb instruction set is a subset of the ARM instruction set and the instructions operate on a restricted view of the ARM registers

Most Thumb instructions are executed unconditionally (All ARM instructions are executed conditionally)

Many Thumb data processing instructions use 2 2-address format, i.e. the destination register is the same as one of the source registers (ARM data processing instructions, with the exception of the 64-bit multiplies, use a 3-address format)

Thumb instruction formats are less regular than ARM instruction formats => dense encoding

Page 5: ARM Processor Architecture (I)

5SOC Consortium Course MaterialARM Platform Design 09/21/2003

Register Access in ThumbNot all registers are directly accessible in ThumbLow register r0 – r7

– fully accessibleHigh register r8 – r12

– only accessible with MOV, ADD, CMPSP (Stack Pointer), LR (Link Register) & PC

(Program Counter)– limited accessibility, certain instructions have implicit

access to theseCPSR

– only indirect accessSPSR

– no access

Page 6: ARM Processor Architecture (I)

6SOC Consortium Course MaterialARM Platform Design 09/21/2003

Thumb Accessible Registers

Shaded registers have restricted access

Page 7: ARM Processor Architecture (I)

7SOC Consortium Course MaterialARM Platform Design 09/21/2003

Branches Thumb defines three PC-relative branch instructions, each of

which have different offset ranges– Offset depends upon the number of available bits

Conditional Branches– B<cond> label– 8-bit offset: range of -128 to 127 instruction (+/-256 bytes)– Only conditional Thumb instructions

Unconditional Branches– B label– 11-bit offset: range of -1024 to 1023 instructions (+/-2K bytes)

Long Branches with Link– BL subroutine– Implemented as a pair of instructions– 22-bit offset: range of -2097152 to 2097151 instruction (+/-4M

bytes)

Page 8: ARM Processor Architecture (I)

8SOC Consortium Course MaterialARM Platform Design 09/21/2003

Data Processing Instruction

Subset of the ARM data processing instructionsSeparate shift instructions (e.g. LSL, ASR, LSR,

ROR)LSL Rd,Rs,#Imm5 ;Rd:=Rs <shift> #Imm5

ASR Rd,Rs ;Rd:=Rd <shift> Rs

Two operands for data processing instructions– Act on low registers

BIC Rd,Rs ;Rd:=Rd AND NOT Rs

ADD Rd,#Imm8 ;Rd:=Rd+#Imm8

– Also three operand forms of add, subtract and shiftsADD Rd,Rs,#Imm3 ;Rd:=Rs+#Imm3

Condition code always set by low register operations

Page 9: ARM Processor Architecture (I)

9SOC Consortium Course MaterialARM Platform Design 09/21/2003

Load or Store Register Two pre-indexed addressing modes

– Base register + offset register– Base register + 5-bit offset, where offset scaled by

• 4 for word accesses (range of 0-124 bytes / 0-31 words)– STR Rd,[Rb,#Imm7]

• 2 for halfword accesses (range of 0-62 bytes / 0-31 halfwords)– LDRH Rd,[Rb,#Imm6]

• 1 for bytes accesses (range of 0-31 bytes)– LDRB Rd,[Rb,#Imm5]

Special forms– Load with PC as base with 1K byte immediate offset (word aligned)

• Used for loading a value from a literal pool

– Load and store with SP as base with 1K byte immediate offset (word aligned)

• Used for accessing local variables on the stack

Page 10: ARM Processor Architecture (I)

10SOC Consortium Course MaterialARM Platform Design 09/21/2003

Block Data Transfers

Memory copy, incrementing base pointer after transfer– STMIA Rb!, {Low Reg list}– LDMIA Rb!, {Low Reg list}

Full descending stack operations– PUSH {Low Reg list}– PUSH {Low Reg List, LR}– POP {Low Reg list}– POP {Low Reg List, PC}

The optional addition of the LR/PC provides support for subroutine entry/exit

Page 11: ARM Processor Architecture (I)

11SOC Consortium Course MaterialARM Platform Design 09/21/2003

Thumb Instruction Entry and Exit

T bit, bit 5 of CPSR– If T = 1, the processor interprets the instruction stream as

16-bit Thumb instruction– If T = 0, the processor interprets if as standard ARM

instructions

Thumb Entry– ARM cores startup, after reset, execution ARM

instructions– Executing a branch and Exchange instruction (BX)

• Set the T bit if the bottom bit of the specified register was set• Switch the PC to the address given in the remainder of the

register

Thumb Exit– Executing a thumb BX instruction

Page 12: ARM Processor Architecture (I)

12SOC Consortium Course MaterialARM Platform Design 09/21/2003

Miscellaneous

Thumb SWI instruction format– Same effect as ARM, but SWI number limited to 0-255– Syntax:

• SWI <SWI number>

SWI number

15 8 7 0

1 1 0 1 1 1 1 1

Indirect access to CPSR and no access to SPSR, so no MRS or MSR instructions

No coprocessor instruction space

Page 13: ARM Processor Architecture (I)

13SOC Consortium Course MaterialARM Platform Design 09/21/2003

ARM Thumb-2 core technology

New instruction set for the ARM architecture Enhanced levels of performance, energy efficiency, and code

density for a wide range of embedded applications

Page 14: ARM Processor Architecture (I)

14SOC Consortium Course MaterialARM Platform Design 09/21/2003

Thumb Instruction Set (1/3)

Page 15: ARM Processor Architecture (I)

15SOC Consortium Course MaterialARM Platform Design 09/21/2003

Thumb Instruction Set (2/3)

Page 16: ARM Processor Architecture (I)

16SOC Consortium Course MaterialARM Platform Design 09/21/2003

Thumb Instruction Set (3/3)

Page 17: ARM Processor Architecture (I)

17SOC Consortium Course MaterialARM Platform Design 09/21/2003

Thumb Instruction Format

Page 18: ARM Processor Architecture (I)

18SOC Consortium Course MaterialARM Platform Design 09/21/2003

ARM/Thumb interworking

Page 19: ARM Processor Architecture (I)

19SOC Consortium Course MaterialARM Platform Design 09/21/2003

The Need for Interworking The code density of Thumb and its performance from narrow

memory make it ideal for the bulk of C code in many systems. However there is still a need to change between ARM and Thumb state within most applications:– ARM code provides better performance from wide memory

• Therefore ideal for speed-critical parts of an application

– Some functions can only be performed with ARM instructions, e.g.• Access to CPSR (to enable/disable interrupts & to change mode)

• Access to coprocessors

– Exception Handling• ARM state is automatically entered for exception handling, but system

specification may require usage of Thumb code for main handler

– Simple standalone Thumb programs will also need an ARM assembler header to change state and call the Thumb routine

Page 20: ARM Processor Architecture (I)

20SOC Consortium Course MaterialARM Platform Design 09/21/2003

ARM/Thumb Interworking

Interworking can be carried out using the Branch Exchange instruction– BX Rn ;Thumb state Branch

;Exchange– BX<condition> Rn ;ARM state Branch

Can also be used as an absolute branch without a state change

Page 21: ARM Processor Architecture (I)

21SOC Consortium Course MaterialARM Platform Design 09/21/2003

Example;start off in ARM state

CODE32ADR r0,Into_Thumb+1 ;generate branch target

;address & set bit 0;hence arrive Thumb state

BX r0 ;branch exchange to Thumb…CODE16 ;assemble subsequent as Thumb

Into_Thumb …ADR r5,Back_to_ARM ;generate branch target to

;word-aligned address,;hence bit 0 is cleared.

BX r5 ;branch exchange to ARM…CODE32 ;assemble subsequent as ARM

Back_to_ARM …

Page 22: ARM Processor Architecture (I)

22SOC Consortium Course MaterialARM Platform Design 09/21/2003

ARM organization

Page 23: ARM Processor Architecture (I)

23SOC Consortium Course MaterialARM Platform Design 09/21/2003

3-Stage Pipeline ARM Organization Register Bank

– 2 read ports, 1 write ports, access any register

– 1 additional read port, 1 additional write port for r15 (PC)

Barrel Shifter– Shift or rotate the operand by

any number of bits ALU Address register and

incrementer Data Registers

– Hold data passing to and from memory

Instruction Decoder and Control

multiply

data out register

instruction

decode

&

control

incrementer

registerbank

address register

barrelshifter

A[31:0]

D[31:0]

data in register

ALU

control

PC

PC

ALU bus

A bus

B bus

register

Page 24: ARM Processor Architecture (I)

24SOC Consortium Course MaterialARM Platform Design 09/21/2003

3-Stage Pipeline (1/2)

Fetch– The instruction is fetched from memory and placed in the instruction

pipeline Decode

– The instruction is decoded and the datapath control signals prepared for the next cycle

Execute– The register bank is read, an operand shifted, the ALU result

generated and written back into destination register

Page 25: ARM Processor Architecture (I)

25SOC Consortium Course MaterialARM Platform Design 09/21/2003

3-Stage Pipeline (2/2)

At any time slice, 3 different instructions may occupy each of these stages, so the hardware in each stage has to be capable of independent operations

When the processor is executing data processing instructions , the latency = 3 cycles and the throughput = 1 instruction/cycle

Page 26: ARM Processor Architecture (I)

26SOC Consortium Course MaterialARM Platform Design 09/21/2003

Multi-cycle Instruction

Memory access (fetch, data transfer) in every cycle Datapath used in every cycle (execute, address calculation,

data transfer) Decode logic generates the control signals for the data path

use in next cycle (decode, address calculation)

Page 27: ARM Processor Architecture (I)

27SOC Consortium Course MaterialARM Platform Design 09/21/2003

Data Processing Instruction

All operations take place in a single clock cycle

address register

increment

registersRd

Rn

PC

Rm

as ins.

as instruction

mult

data out data in i. pipe

(a) register - register operations

address register

increment

registersRd

Rn

PC

as ins.

as instruction

mult

data out data in i. pipe

[7:0]

(b) register - immediate operations

Page 28: ARM Processor Architecture (I)

28SOC Consortium Course MaterialARM Platform Design 09/21/2003

Data Transfer Instructions

Computes a memory address similar to a data processing instruction Load instruction follow a similar pattern except that the data from memory

only gets as far as the ‘data in’ register on the 2nd cycle and a 3rd cycle is needed to transfer the data from there to the destination register

address register

increment

registersRn

PC

lsl #0

= A / A + B / A - B

mult

data out data in i. pipe

[11:0]

(a) 1st cycle - compute address

address register

increment

registersRn

Rd

shifter

= A + B / A - B

mult

PC

byte? data in i. pipe

(b) 2nd cycle - store data & auto-index

Page 29: ARM Processor Architecture (I)

29SOC Consortium Course MaterialARM Platform Design 09/21/2003

Branch Instructions

The third cycle, which is required to complete the pipeline refilling, is also used to mark the small correction to the value stored in the link register in order that is points directly at the instruction which follows the branch

address register

increment

registersPC

lsl #2

= A + B

mult

data out data in i. pipe

[23:0]

(a) 1st cycle - compute branch target

address register

increment

registersR14

PC

shifter

= A

mult

data out data in i. pipe

(b) 2nd cycle - save return address

Page 30: ARM Processor Architecture (I)

30SOC Consortium Course MaterialARM Platform Design 09/21/2003

Branch Pipeline Example

Breaking the pipelineNote that the core is executing in the ARM state

Page 31: ARM Processor Architecture (I)

31SOC Consortium Course MaterialARM Platform Design 09/21/2003

5-Stage Pipeline ARM Organization

Tprog = Ninst * CPI / fclk

– Tprog: the time that execute a given program

– Ninst: the number of ARM instructions executed in the program => compiler dependent

– CPI: average number of clock cycles per instructions => hazard causes pipeline stalls

– fclk: frequency

Separate instruction and data memories => 5 stage pipeline

Used in ARM9TDMI

Page 32: ARM Processor Architecture (I)

32SOC Consortium Course MaterialARM Platform Design 09/21/2003

5-Stage Pipeline Organization (1/2) Fetch

– The instruction is fetched from memory and placed in the instruction pipeline

Decode– The instruction is decoded

and register operands read from the register files. There are 3 operand read ports in the register file so most ARM instructions can source all their operands in one cycle

Execute– An operand is shifted and

the ALU result generated. If the instruction is a load or store, the memory address is computed in the ALU

I-cache

rot/sgn ex

+4

byte repl.

ALU

I decode

register read

D-cache

fetch

instructiondecode

execute

buffer/data

write-back

forwardingpaths

immediatefields

nextpc

regshift

load/storeaddress

LDR pc

SUBS pc

post-index

pre-index

LDM/STM

register write

r15

pc + 8

pc + 4

+4

mux

shift

mul

B, BL

MOV pc

Page 33: ARM Processor Architecture (I)

33SOC Consortium Course MaterialARM Platform Design 09/21/2003

5-Stage Pipeline Organization (2/2) Buffer/Data

– Data memory is accessed if required. Otherwise the ALU result is simply buffered for one cycle

Write back– The result generated by the

instruction are written back to the register file, including any data loaded from memory

I-cache

rot/sgn ex

+4

byte repl.

ALU

I decode

register read

D-cache

fetch

instructiondecode

execute

buffer/data

write-back

forwardingpaths

immediatefields

nextpc

regshift

load/storeaddress

LDR pc

SUBS pc

post-index

pre-index

LDM/STM

register write

r15

pc + 8

pc + 4

+4

mux

shift

mul

B, BL

MOV pc

Page 34: ARM Processor Architecture (I)

34SOC Consortium Course MaterialARM Platform Design 09/21/2003

Pipeline HazardsThere are situations, called hazards, that prevent

the next instruction in the instruction stream from being executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining.

There are three classes of hazards: – Structural Hazards: They arise from resource conflicts

when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution.

– Data Hazards: They arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline.

– Control Hazards: They arise from the pipelining of branches and other instructions that change the PC

Page 35: ARM Processor Architecture (I)

35SOC Consortium Course MaterialARM Platform Design 09/21/2003

Structural Hazards

When a machine is pipelined, the overlapped execution of instructions requires pipelining of functional units and duplication of resources to allow all possible combinations of instructions in the pipeline.

If some combination of instructions cannot be accommodated because of a resource conflict, the machine is said to have a structural hazard.

Page 36: ARM Processor Architecture (I)

36SOC Consortium Course MaterialARM Platform Design 09/21/2003

ExampleA machine has shared a single-memory pipeline for data

and instructions. As a result, when an instruction contains a data-memory reference (load), it will conflict with the instruction reference for a later instruction (instr 3):

Clock cycle number

instr 1 2 3 4 5 6 7 8

load IF ID EX MEM WB

Instr 1 IF ID EX MEM WB

Instr 2 IF ID EX MEM WB

Instr 3 IF ID EX MEM WB

Page 37: ARM Processor Architecture (I)

37SOC Consortium Course MaterialARM Platform Design 09/21/2003

Solution (1/2)

To resolve this, we stall the pipeline for one clock cycle when a data-memory access occurs. The effect of the stall is actually to occupy the resources for that instruction slot. The following table shows how the stalls are actually implemented.

Clock cycle number

instr 1 2 3 4 5 6 7 8 9

load IF ID EX MEM WB

Instr 1 IF ID EX MEM WB

Instr 2 IF ID EX MEM WB

Instr 3 stall IF ID EX MEM WB

Page 38: ARM Processor Architecture (I)

38SOC Consortium Course MaterialARM Platform Design 09/21/2003

Solution (2/2)

Another solution is to use separate instruction and data memories.

ARM used Harvard architecture, so we do not have this hazard

Page 39: ARM Processor Architecture (I)

39SOC Consortium Course MaterialARM Platform Design 09/21/2003

Data Hazards

Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially executing instructions on the unpipelined machine.

Clock cycle number

1 2 3 4 5 6 7 8 9

ADD R1,R2,R3 IF ID EX MEM WB

SUB R4,R5,R1 IF IDsub EX MEM WB

AND R6,R1,R7 IF IDand EX MEM WB

OR R8,R1,R9 IF IDor EX MEM WB

XOR R10,R1,R11 IF IDxor EX MEM WB

Page 40: ARM Processor Architecture (I)

40SOC Consortium Course MaterialARM Platform Design 09/21/2003

Forwarding

The problem with data hazards, introduced by this sequence of instructions can be solved with a simple hardware technique called forwarding.

Clock cycle number

1 2 3 4 5 6 7

ADD R1,R2,R3 IF ID EX MEM WB

SUB R4,R5,R1 IF IDsub EX MEM WB

AND R6,R1,R7 IF IDand EX MEM WB

Page 41: ARM Processor Architecture (I)

41SOC Consortium Course MaterialARM Platform Design 09/21/2003

Forwarding Architecture Forwarding works as

follows: – The ALU result from the EX/MEM

register is always fed back to the ALU input latches.

– If the forwarding hardware detects that the previous ALU operation has written the register corresponding to the source for the current ALU operation, control logic selects the forwarded result as the ALU input rather than the value read from the register file.

I-cache

rot/sgn ex

+4

byte repl.

ALU

I decode

register read

D-cache

fetch

instructiondecode

execute

buffer/data

write-back

forwardingpaths

immediatefields

nextpc

regshift

load/storeaddress

LDR pc

SUBS pc

post-index

pre-index

LDM/STM

register write

r15

pc + 8

pc + 4

+4

mux

shift

mul

B, BL

MOV pc

forwarding paths

Page 42: ARM Processor Architecture (I)

42SOC Consortium Course MaterialARM Platform Design 09/21/2003

Forward Data

The first forwarding is for value of R1 from EXadd to EXsub. The second forwarding is also for value of R1 from MEMadd to EXand. This code now can be executed without stalls.

Forwarding can be generalized to include passing the result directly to the functional unit that requires it

A result is forwarded from the output of one unit to the input of another, rather than just from the result of a unit to the input of the same unit.

Clock cycle number

1 2 3 4 5 6 7

ADD R1,R2,R3 IF ID EXadd MEMadd WB

SUB R4,R5,R1 IF ID EXsub MEM WB

AND R6,R1,R7 IF ID EXand MEM WB

Page 43: ARM Processor Architecture (I)

43SOC Consortium Course MaterialARM Platform Design 09/21/2003

Without Forward

Clock cycle number

1 2 3 4 5 6 7 8 9

ADD R1,R2,R3 IF ID EX MEM WB

SUB R4,R5,R1 IF stall stall IDsub EX MEM WB

AND R6,R1,R7 stall stall IF IDand EX MEM WB

Page 44: ARM Processor Architecture (I)

44SOC Consortium Course MaterialARM Platform Design 09/21/2003

Data Forwarding Data dependency arises when an instruction needs to use

the result of one of its predecessors before the result has returned to the register file => pipeline hazards

Forwarding paths allow results to be passed between stages as soon as they are available

5-stage pipeline requires each of the three source operands to be forwarded from any of the intermediate result registers

Still one load stallLDR rN, […]

ADD r2,r1,rN ;use rN immediately– One stall– Compiler rescheduling

Page 45: ARM Processor Architecture (I)

45SOC Consortium Course MaterialARM Platform Design 09/21/2003

Stalls are required

1 2 3 4 5 6 7 8

LDR R1,@(R2) IF ID EX MEM WB

SUB R4,R1,R5 IF ID EXsub MEM WB

AND R6,R1,R7 IF ID EXand MEM WB

OR R8,R1,R9 IF ID EXE MEM WB

The load instruction has a delay or latency that cannot be eliminated by forwarding alone.

Page 46: ARM Processor Architecture (I)

46SOC Consortium Course MaterialARM Platform Design 09/21/2003

The Pipeline with one Stall

1 2 3 4 5 6 7 8 9

LDR R1,@(R2) IF ID EX MEM WB

SUB R4,R1,R5 IF ID stall EXsub MEM WB

AND R6,R1,R7 IF stall ID EX MEM WB

OR R8,R1,R9 stall IF ID EX MEM WB

The only necessary forwarding is done for R1 from MEM  to EXsub.

Page 47: ARM Processor Architecture (I)

47SOC Consortium Course MaterialARM Platform Design 09/21/2003

LDR Interlock

987654321Cycle

Operation

ADD

SUB

LDR

ORR

AND

EOR

R1,R1,R2

R3,R4,R1

R4,[R7]

R8,R3,R4

R6,R3,R1

R3,R1,R2

F

F

F

F

F

F

D

D

D

D

D

D

E

E

E

E

E

E

W

W

W

W

W

W

M

I

I

F-Fetch D-Decode E-Execute I-Interlock M-Memory W-Writeback

In this example, it takes 7 clock cycles to execute 6 instructions, CPI of 1.2

The LDR instruction immediately followed by a data operation using the same register cause an interlock

Page 48: ARM Processor Architecture (I)

48SOC Consortium Course MaterialARM Platform Design 09/21/2003

987654321Cycle

Operation

ADD

SUB

LDR

ORR

AND

EOR

R1,R1,R2

R3,R4,R1

R4,[R7]

R6,R3,R1

R8,R3,R4

R3,R1,R2

F

F

F

F

F

D

D

D

D

D

E

E

E

E

E

W

W

W

W

W

M

F-Fetch D-Decode E-Execute I-Interlock M-Memory W-Writeback

F D E W

Optimal Pipelining

In this example, it takes 6 clock cycles to execute 6 instructions, CPI of 1

The LDR instruction does not cause the pipeline to interlock

Page 49: ARM Processor Architecture (I)

49SOC Consortium Course MaterialARM Platform Design 09/21/2003

LDM Interlock (1/2)

In this example, it takes 8 clock cycles to execute 5 instructions, CPI of 1.6

During the LDM there are parallel memory and write back cycles

987654321Cycle

Operation

LDMIA

SUB

STR

ORR

AND

R13!,{R0-R3}

R9,R7,R2

R4,[R9]

R8,R4,R3

R6,R3,R1

F

F

F

F

F

D

D

D

D

D

E

E

E

E

E

MW

W

W

W

W

M

F-Fetch D-Decode E-Execute I-Interlock M-Memory MW - Simultaneous Memory and Writeback W-Writeback

M MWMW W

I II E W

I I I D E M W

10

Page 50: ARM Processor Architecture (I)

50SOC Consortium Course MaterialARM Platform Design 09/21/2003

LDM Interlock (2/2)

In this example, it takes 9 clock cycles to execute 5 instructions, CPI of 1.8

The SUB incurs a further cycle of interlock due to it using the highest specified register in the LDM instruction

987654321Cycle

Operation

LDMIA

SUB

STR

ORR

AND

R13!,{R0-R3}

R9,R7,R3

R4,[R9]

R8,R4,R3

R6,R3,R1

F

F

F

F

F

D

D

D

D

D

E

E

E

E

E

MW

W

W

M

F-Fetch D-Decode E-Execute I-Interlock M-Memory MW - Simultaneous Memory and Writeback W-Writeback

M MWMW W

I II E W

I I I D E M W

I

I

10

Page 51: ARM Processor Architecture (I)

51SOC Consortium Course MaterialARM Platform Design 09/21/2003

Control hazards (1/2)

Branch IF ID EXE MEM WB

Branch successor IF (stall) Stall IF ID EXE MEM WB

Branch successor+1 IF ID EXE MEM WB

Control hazards can cause a greater performance loss for ARM pipeline that data hazards.

When a branch is executed, it may or may out change the PC (program counter) to something other than its current value plus 4.

The simplest method of dealing with branches is to stall the pipeline as soon as the branch is detected until we reach the EX stage

Page 52: ARM Processor Architecture (I)

52SOC Consortium Course MaterialARM Platform Design 09/21/2003

Control hazards (2/2)

The number of clock cycles can be reduced by two steps– Find our whether the branch is taken or not taken earlier

in the pipeline– Compute the taken PC (i.e., the address of the branch

target) earlier

We will discuss branch prediction schemes

Page 53: ARM Processor Architecture (I)

53SOC Consortium Course MaterialARM Platform Design 09/21/2003

Branch prediction

Branch prediction is to predict the branch as no taken, simply allowing the hardware to continue as if the branch were not executed.

Care must be taken not to change the machine state until the branch outcome is definitely known.

Page 54: ARM Processor Architecture (I)

54SOC Consortium Course MaterialARM Platform Design 09/21/2003

Predict Not Taken

Untaken Branch Instr IF ID EXE MEM WB

Instr i+1 IF ID EXE MEM WB

Instr I+2 IF ID EXE MEM WB

Taken Branch Instr IF ID EXE MEM WB

Instr i+1 IF idle idle idle idle

Branch target IF ID EXE MEM WB

Branch target+1 IF ID EXE MEM WB

The pipeline with this scheme implemented behaves as shown below:

Page 55: ARM Processor Architecture (I)

55SOC Consortium Course MaterialARM Platform Design 09/21/2003

Predict Taken

An alternative scheme is to predict the branch as taken.

ARM employs a static branch prediction mechanism– Conditional branches that branch backwards are

predicted to be taken– Conditional branches that branch forwards are predicted

not to be taken

Page 56: ARM Processor Architecture (I)

56SOC Consortium Course MaterialARM Platform Design 09/21/2003

Summary

Instruction set– 32 bit ARM instruction– 16 bit Thumb instruction

ARM/Thumb interworkingARM organization

– 3-stage pipeline• Fetch/Decode/Execute

– 5-stage pipeline• Fetch/Decode/Execute/Buffer/Write Back• Pipeline hazards

– Structure hazard

– Data hazard

– Control hazard

Page 57: ARM Processor Architecture (I)

57SOC Consortium Course MaterialARM Platform Design 09/21/2003

References

[1] http://twins.ee.nctu.edu.tw/courses/ip_core_02/index.html

[2] ARM System-on-Chip Architecture, Second Edition, edited by S.Furber, Addison Wesley Longman: ISBN 0-201-67519-6.

[3] Architecture Reference Manual, Second Edition, edited by D. Seal, Addison Wesley Longman: ISBN 0-201-73719-1.

[4] www.arm.com