Transitioning to Cortex-M3 based MCUs - RTC Grouprtcgroup.com/arm/2007/presentations/129 - Transitioning to Cortex... · Transitioning to Cortex-M3 based MCUs Paul Kimelman - CTO

Transitioning to Cortex-M3 based MCUs

Paul Kimelman - CTOLuminary Micro, Inc

222Confidential

ContentsIntroductionC/C++ with no assembly code needed?Performance and code size – what you should knowInterrupts and style of applicationHigh integration to save BOM costUse of special instructions – from C/C++

333Confidential

IntroductionI assume you have heard Shyam’s presentationFocus on transition vs. porting

Porting is mostly about making it workTransitioning covers:

Change in peripheral interfaceChange in application approachPorting issues for performance, size, behavior

Will cover issues coming from 8-bit/16-bit, ARM7, and ARM9Time to unlearn bad habits forced on youDo not be fooled by MHz or “MIPS”Application style to best fit the needsMore integrated in HW and more done in SW

Focus on code/data size, performance, BOM cost, power

444Confidential

C/C++ vs. AssemblyWhy do you normally end up having to write assembly?

Vector table (needs “call” or “mov”, etc)Interrupt entry/exit stubs

Keyword usually will not support priority nestingCompiled code too big and/or slow – parts of application must be hand codedSpecialized features, not compiler friendlyInitialization code – unless acceptable one in C runtime lib

So, why not with Cortex-M3?Vector table is C array of pointers. 1st entry is Stack pointer.All ISRs are normal C functions with no special keyword, even Reset

Priority nesting supported in all cases (including faults, system handling)Instruction set is compiler friendly. Compilers can detect cases of special instructions

e.g. REVerse and REV16((x & 0x00ff) << 8) | ((x & 0xff00) >> 8)

Initialization is C function (ResetISR) with stack already setup by HW

555Confidential

Coming from PICCortex-M3 uses standard C code

No #fuses or #uses, set configuration as and when neededNo #INT_xxx function tags – just normal C functions

Just point to function from one or more vectors in C arrayISRs can call other functions, have all registers available

Stack may be common or separate one for all ISRsHardware routes directly to each ISR – no software looking at flagsCan change ISRs dynamically (vector table can be moved to SRAM or elsewhere in Flash, not just one “alternate” as in PIC24+)At least 8 priorities (vs. 2 on 8-bit, 7 on PIC24+), easily set/changed, with priority masking. Faults can be prioritized alsoNMI for safety use - cannot be masked off

All GPIOs/Peripherals are direct writable/readable/configurable.ARM GPIOs allow up to 8 GPIOs to be accessed in one LDR/STR

666Confidential

Coming from 8051One unified address space, divided into ½GB regions

Same instructions used for all locations, no speed penaltyCode (Flash) from 0, SRAM (internal) from 0x20000000, Peripherals (internal) from 0x40000000External RAM/Peripherals in middle (but not likely used)System registers (interrupt contoller, etc) from 0xE0000000Bit access for 1st 1MB of RAM and 1st 1MB of Peripherals

Same model as 8051 (can access same location by bits & byte/half/word)RMW is atomicDoes not need special instructions, so compiler friendly

Any pointer or variable may be usedSystem/peripheral registers (SFR) are memory mapped

All accessible using normal C codeSimilar interrupt model: enables, priority, fixed assignments

More than two levels of priority, no SW save of PSW/ACC/etc, vector pointers

777Confidential

Coming from MSP430Cortex-M3 uses standard C/C++

32-bit words vs. 16-bitContiguous RAM, starts at ½GBContiguous Flash for code and data/configuration, starts at 0Multiply (and Divide) is safe for interrupts

Interrupts are also vectoredUser set priority vs. position in tableNesting is automatic (by priority) vs. GIENo special code or special work in C

RISC oriented instruction set13 general purpose registers (3 more are for SP, LR, PC)Constants from MOV instructionNo indirect addressing, but PC-relative address “literals” in FlashMost instructions are 1 cycle, not related to size

ARM GPIOs are similar designBut consistent (all follow same rules) and more pin control

888Confidential

PerformanceDo not be fooled by MHz or “MIPS”

Only valid measure is amount of work done in a given period of timeFaster MHz for many processors/MCUs barely runs faster

Introduces code wait states (Flash and/or RAM)Stalls on peripherals – often big part of applicationNon-deterministic behavior

Instruction “prediction”, branch caches, caches, etcMIPS is measure of instruction set style, not work

E.g. 3-12 cycle hardware 32-bit/32-bit DIVide vs. 50+ cycles in software50+ cycles has higher MIPS!

Less RISCy instructions get more work done, lower MIPS numberBetter code density from less RISCy instructions

DMIPS mainly tells you about 3 things – memcpy(), strcmp(), and divDiv and strcmp() are often “gamed” by compiler vendorsMore and more compilers “cheat” using auto-inlining, whole program opt

999Confidential

Peripherals and their bus interfaceWait states on peripherals are a hidden cost

Watch for slower peripheral buses – on any processorWhen peripheral bus is slower than core clock – wait states

Impacts even store when have to write more than one in a row Impacts maximum toggle rate of GPIOs, ability to feed/drain data, etc

You want a fast bus, regardless of peripheral rateWait states means processor stalled

Affects what you can do, but also interrupt latencyIdeal is feed/drain peripheral FIFO quickly, then have lots of time before need to service peripheral againEven more critical if you have to bit-bang through GPIOs

101010Confidential

Performance – interrupt overheadReal measure: Time from HW trigger to 1st line of real user code

Longest instruction which stops interrupt adds to latency (tA)e.g. LDM of 8 elements on ARM7 holds off interrupts for 10 cycles if from 0 wait state memory, for 26 cycles if 2 wait state peripheral, etcCortex-M3 uses interrupt-continue for LDM/STM and abandon for DIV/UMLAL/etc

Pushing registers, messing with modes, etc (tB)Many example applications use direct entry, but that does not scale to multiple interrupts or multiple at same time (nesting based on priority)

Often more than 20 cycles of difference in timing when allow nestingCortex-M3 does in HW in 12 cycles (saves registers and loads pipeline).

So, user code is now running – but be aware of function prologue on any.Popping registers and resetting interrupt controller (tD)

Even when leaving one to enter another – pops all then pushes againCortex-M3 “tail chains” – skips pop/push and just jumps to new ISR (skips tE)

Higher priority interrupt held off by any of aboveThis is the case you have to allow for. If no nesting, then add longest ISR! (tC)Cortex-M3: full priorities and nesting, pre-empt anytime, take over during transitions

tA tB tC tD tE

111111Confidential

Interrupt jitterInterrupt jitter is variability of response to interrupt trigger(external or internal)

Priority jitter is a given (higher priority interrupt should delay lower one).Jitter on high priority interrupt is a serious matter.

Most common jitter cause is high priority interrupt being held off when in overhead for lower priority one (registers/mode-save).Even worse is case where processor does not allow nesting

High priority interrupt delayed by length of lower priority ISR

t

Trigger

Range of time beforeISR serviced=jitter

Time to ISR on different invocations

121212Confidential

Interrupt response jitterIf you have two (or more) interrupts, what happens when they intersect?

Gpio1 is higher priority than Gpio2. Gpio2 is fixed periodic.Both ISRs take the same time (for this example).Shows skew in start time for Gpio2 and Gpio1.

Expect priority based jitter for Gpio2Issue is Gpio1 jitter (purple double line)

1. Gpio2 completes before Gpio1(no delay for either)

2. Gpio1 comes when in Gpio2(Gpio2 gets delayed if nesting)

3. Gpio1 comes before Gpio2(Gpio2 gets delayed)

4. Gpio1 comes during overhead(Depends on processor)

Gpio2 triggers Gpio2 triggers Gpio2 triggers

CM3 Other w/nesting Other, no nesting

Key:Interrupt entry overheadInterrupt exit overheadPre-empted

131313Confidential

Interrupt response jitterIf you have two (or more) interrupts, what happens when they intersect?

Gpio1 is higher priority than Gpio2. Gpio1 is fixed periodic (this time).Both ISRs take the same time (for this example).Shows skew in start time for Gpio2 and Gpio1.

Expect priority based jitter for Gpio2Issue is jitter for Gpio1 (purple double line)

1. Gpio1 completes before Gpio2(no delay for either)

2. Gpio2 comes when in Gpio1(Gpio2 gets delayed - pri)

3. Gpio2 comes before Gpio1(Gpio2 gets pre-empted if can)

4. Gpio2 comes earlier before Gpio1(Gpio2 gets pre-empted if can)

Gpio1 triggers Gpio1 triggers Gpio1 triggers

CM3 Other w/nesting Other, no nesting

141414Confidential

Effect of Critical sectionsCritical sections tend to “pack” interrupts at the enable point.

This is made worse when triggers are result of outputs (cycles)The input/output cycle moves against enableOver time, more do this, and inputs tend to land on each other

Long latency instructions, on processors that block, do this too

CM3 provides ways to mitigate most and avoid manyWhen you need them, use priority masking (BASEPRI) not disable

Don’t punish the ISRs that are not using the critical data!

151515Confidential

Performance and sizeCode ported from 8-bit/16-bit may bloat on 32-bit

Short/char locals can cause 40%+ increase in size and speed impactUse ints (unsigned, int, long, unsigned long) – they are optimalCan up-cast from smaller global/statics (e.g. extern short x; int lx = (int)x;)

Do not take address of local, forces to stack – otherwise in register onlyHow you access peripherals affects performance and size a lot

Casted constants may be worst way! (e.g. *((unsigned*)0x40001008) Smaller number of larger functions more optimal (opposite of 8-bit)Back-to-back loads from peripheral is faster and smallerAvoiding back-to-back stores to peripheral is faster

Use optimizerMany 8-bit/16-bit compilers have no real optimizer – very important on 32-bitCode size and performance are dramatically affected (often >30%)Check if compiler defaults to optimize for size or speed – not consistentUse volatile for peripheral pointers (#define or not) and peripheral objects

Optimizer may get rid of code, reverse order, or otherwise “optimize”

161616Confidential

Using locals smaller than register sizetypedef short BASE;BASE foo(BASE last, BASE x, BASE y) {0: f04f 0c00 mov.w ip, #0 ; 0x04: e004 b.n 10 <foo+0x10>BASE i;for (i = 0; i < last; i++)x += (y * x);

6: fb02 1301 mla r3, r2, r1, r1a: f10c 0c01 add.w ip, ip, #1 ;

0x1e: b219 sxth r1, r310: fa0f f38c sxth.w r3, ip14: 4283 cmp r3, r016: dbf6 blt.n 6 <foo+0x6>18: ebc2 0001 rsb r0, r2, r11c: b200 sxth r0, r0return(x-y);}

1e: 4770 bx lr

typedef int BASE;BASE foo(BASE last, BASE x, BASE y) {0: 2300 movs r3, #02: e002 b.n a <foo+0xa>BASE i; for (i = 0; i < last; i++)x += (y * x);

4: fb02 1101 mla r1, r2, r1, r18: 3301 adds r3, #1a: 4283 cmp r3, r0c: dbfa blt.n 4 <foo+0x4>e: ebc2 0001 rsb r0, r2, r1return(x-y);}

12: 4770 bx lr

Locals of size short int (half word)Locals of size int

A short int local added 12 extra bytes to a function of 20 bytes. Worse, it has added 2 extra cycles to each iteration (a 5 cycle loop)

Note: ARM7/ARM9 using Thumb code is 28 bytes with int (but much slower)40 bytes with short int (so, +12) Extra 12 bytes for the short int for Thumb is due to using shift-left and then shift-right to sign or unsign extend, so 4 extra cycles per loop.

171717Confidential

Application styleApplication design affects performance, size, power useThree most common types

Pure interruptPolling (PLC, DSP style, event/PID loop, etc)Polling/RTOS with ISRs

Many people move to polling due to processor issuesWhen 30% or more lost to interrupts, context switching, etc, what choice?

Pure interrupt ideal for many smaller applicationsPolling/RTOS with ISRs gives excellent design options

Communications in ISRsTime critical operations in ISRsThe rest is easier to design and program

181818Confidential

Application design – mixed example

Main application runs as foreground (base level)Easy to write since no “factoring” – just normal application or RTOS basedCan use PLC style state-machine poll loop safely: ISRs keep data available

ISRs for Motor control are highest priority(ies)PWM, ADCs, Timer(s), Fault (may be highest), Temp sensor, etc

ISRs for communications below thatEthernet, CAN, and/or serial

May use other priorities as neededVery fast interrupt response time, true nested interrupts, priority masking, easy ISR setup all contribute to making an easy solutionApplication uses priority masking vs. interrupt-disable if needs critical region

t

Motor control ISRs (e.g. PWM, ADC)

Communication ISRs (e.g. ENET, CAN)

Main application (foreground)

191919Confidential

Avoiding interrupt latency on Cortex-M3I have critical data, don’t I just create latency with int disable?Three easy ways to avoid this

BASEPRI and BASEPRI_MAX: set priority to mask, don’t disableIf critical data used by priorities 5 to 7, set BASEPRI to 5

Interrupts 0 to 4 can still activate as normal (e.g. motor control)BASEPRI_MAX will only change if makes higher priority mask

No compare needed. Set, critical-section, restore w/BASEPRIExclusives (LDREX/STREX for byte, half, word)

Much better than test-and-setISRs can set/clear data non-locking/non-blocking

main loop and lower priority ISRs just try again – no block/lockE.g. RTOS queues between thread/ISR with no critical section

Bit band forms atomic read-modify-write on SRAM and PeripheralsSet population/claim/request bits

E.g. Thread-wake population bit + PendSV

202020Confidential

Polling vs. interruptPolling is poor use of processor (wastes time)

Introduces jitter (based on loop size, load time, etc)Performance degrades quickly as more checks addedMost common reason used is easier to understand

If interrupt overhead is low, it is better useSome processors add so much overhead that polling is betterCortex-M3 offers low overhead and low latency

With multiple priorities and low latency, easily understood behavior

FIFOed communication peripherals offer best of bothAmortize whatever interrupt overhead, but no extra spins pollingIf interrupt overhead is too high, then FIFO needed just to work at all

212121Confidential

Poll loop – simple examplefor (i = 0; i < loops; i++){ // poll loopif(HWREG(GPIO_IN_PORT_CLOCK)){ // detect high, drive highHWREG(GPIO_SCOPE_PORT) = PIN_OUT_SCOPE;break;

}else{ // detect low, drive lowHWREG(GPIO_SCOPE_PORT) = 0;

}}// now capture data…

222222Confidential

Polling: read of input, write output

INPUT(CLOCK)

OUTPUT(SCOPE)

C-M3(50MHz)

255ns jitter190ns fastest

232323Confidential

Polling: ARM7 (same clock speed)

INPUT(CLOCK)

OUTPUT(SCOPE)

ARM7(60MHz)

500ns jitter230ns fastest

242424Confidential

CM3 – can nest and prioritize, etc

ARM7 version – no nest: one at a time (fastest if no nest)

Interrupt driven

void GPIO_trigger_ISR(void) { // on fallingdefine locals for rest of routine hereHWREG(GPIO_SCOPE_PORT) = PIN_OUT_SCOPE;// drive high// now capture data, etc…HWREG(GPIO_SCOPE_PORT) = 0; // drive lowreturn; // done

__irq __arm void GPIO_trigger_ISR(void) { // on falling// __irq means pushes registers at start, pops at // end, and special return instructiondefine locals for rest of routine hereHWREG(GPIO_SCOPE_PORT) = PIN_OUT_SCOPE;// drive high// now capture data, etc…VICVectAddr = 0; // clear VICHWREG(GPIO_SCOPE_PORT) = 0; // drive lowreturn; // done

252525Confidential

Interrupt: drive high on falling edge (+work)

FRAME

CLOCK

DATA

SCOPE(ISR)

540ns27 cyc

Min ~640ns

C-M3(50MHz)

NO JITTERCost of interrupt (12), prologue of function, GPIO address load,STR and propagation. Will be less for some functions.

262626Confidential

Interrupt: ARM7 (same operation)

FRAME

CLOCK

DATA

SCOPE(ISR)

820 ns39 cyc

ARM7(60MHz)

Min ~1340ns

Note: time will increase if interrupts a long instruction. Nesting support adds >18 cycles (mode change)

272727Confidential

RTOSConcerns about using an RTOS

Efficiency of Task SwitchingExtra Memory Used

Cortex-M3 has many RTOS-friendly featuresFaster/easier context switch - PendSV

Separation of service call (SVC) and context switchOption of separate thread vs. interrupt/system stack

User/privilege for those that need it (use SVC vs. call)Standard timer, SYSTICKStandard interrupt controllerMPU for safety

282828Confidential

PendSV for context switchPendSV is software triggered exception

Pended, so executes when priority allowsCan be set by scheduler, ISR, or system codeCan be used with SVC or not (all privileged code)

Set at low(est) priority in the systemEnsures it is the last handler to run (tail chaining)

On entry, half of interrupted thread is already savedSteps are simple:

Save other half on old process stackRetrieve new process stack from TCBSwitch process stackLoad half of new process context from process stackException return (loads rest in HW)

292929Confidential

RTOS using PendSV (and maybe SVC)

Threads User, System Privileged:

T1 T2

T2

Thread uses SVC to make system request. SVCalluses PendSV to cause dispatch if request causes blocking/thread-change.

Threads Privileged:

T1

Thread calls system for request, which then uses SVC when blocking/thread-change needed. SVCalluses PendSV to cause dispatch.

T3T1

Interrupt comes in, makes system call, changes next thread:

Interrupt calls system, side effect is rescheduling needed, so pends PendSV.

PendSV was re-pended and so tail-chains to itself – causing a possible rescheduling.

Key for all figures:AppSystemKernelTail chainTail chain

T1 T2

Thread calls system for request, which then uses PendSV to cause dispatch.

303030Confidential

FreeRTOS.org Context SwitchingContext SaveSave R0Get task SP in R0Save return addressRestore R0Push all registers (task stack)Push SPSRPush nesting depth on stackStore new task SP in TCB(19 ARM instructions, manycycles, ints blocked in Push)Context RestoreGet task SP from TCBPop nesting depthPop SPSRPop all registersPop return addressReturn (new task)(12 ARM instructions, …)

Context SaveGet PSP in R0Push R4-R11 on task stackPush nesting depthStore new SP in TCB(11 Thumb-2 instructions, farfewer instructions, ints not blocked)

Context RestoreGet SP from TCBPop nesting depthPop R4-R11Load PSP with new task stackIf non-zero nesting, mask intsReturn (new task)(14 Thumb-2 instructions,12 or 13 executed)

ARM7 (SWI) Cortex-M3 (PendSV)

313131Confidential

Example FreeRTOS.org timing

4676 (thumb)3504Image size

ARM7Cortex-M3

145K/sec250K/secSwitches/second

6.9 µs/switch4 µs/switch†Time per switch(thread+kernel)

Simple 2-task application that switches between themCortex-M3 at 50MHz, ARM7 at 60MHzGCC for both, Thumb mode for ARM7 threads† - Cortex-M3 is even faster with newer FreeRTOSversion that has just been released.

323232Confidential

Many excellent RTOS ports availableCMX Systems CMX-RTX and CMX-TinyExpress Logic ThreadXFreeRTOS.org FreeRTOSIAR PowerPacInterniche NicheTaskKeil/ARM RTXMicrium μC/OS-IIPumpkin SalvoSegger embOSOthers…

333333Confidential

Lower total BOM costDo more in SW

For example, motor controlBit-bang vs. CPLD or FPGAHigh speed serial to accomplish moreUse lower cost components when can offload work

Higher end peripheralsMore supportable with Cortex-M3, so can do moreCan service higher rates

e.g. 100baseT, 1Mbps CAN, 1Msps ADC, 25MHz SPI, etcSafety (e.g. IEC 61508)

Faults, MPU, lock-up, NMI, prioritized ISRs for deterministic responseWhat was two or three 8-bit MCUs can be done in one

Acts like virtual multi-processor (via ISRs)

343434Confidential

Special instructionsThumb-2 and Cortex-M3 have many special instructions

Many are directly used by compilere.g. SDIV/UDIV, MUL/MLA/MLS, UMULL/SMULL/SMLAL/UMLAL, SBFX/UBFX, BFI/BFC, MOVT/MOVW, SXTH/UXTH/SXTB/UXTB

Some compilers may detect some cases and use:e.g. REV/REV16/REVSH, CLZElse, use access “instruction intrinsics” (e.g. ntohs/htons inlined)

Others available through “instruction intrinsics”e.g. USAT, SSAT, RBIT, WFI, WFE, SEV, MSR, MRS, CPS, etc

System features available as memory mapped registersNVIC controls, setup, managementMost system controls, systick, reset control, MPU, etcMPU optimized to allow STM/LDM to handle multiple regions at once

Also allows sub-regions for better granularity

353535Confidential

Sleep primitivesSleep vs. Deep-sleep – memory mapped register

Deep sleep allows chip vendor more cycles to wakeup

Sleep-on-exit controlWhen last ISR returns, sleepIdle thread – skips pop/push for no purpose

WFI – wait for interrupt to wake up, sleep untilWFE – wait for event, sleep until

Trip-latch – remembers previous set (SEV, or event)Wakes on interrupt pending if SEVONPEND

Used for intelligent pollingMakes for non-bus contending poll

363636Confidential

Using SWV to get interrupt traceAccurate to the cycle (e.g. 20ns at 50MHz)

Can see jitter, variability of execution time, periodicity, etcAllows seeing nesting behavior (pre-emption)

Can also see related to sleep time and main thread timeCan be intermixed with other traced info, to see real behavior

For example, RTOS trace, watch-trace, host strings, etc

373737Confidential

Using SWV for extreme accuracy profilingHW PC Sampling at speeds such as 48,828 samples/second

CPI calculations add detailed information on mix of instructions and overhead

383838Confidential

Concerned about Cortex-M3 maturity?Cortex-M3 has exceeded the high reliability and maturity standard set by previous cores by a wide margin

The r1p0 core used in Stellaris Sandstorm parts, and the r1p1 core used in Stellaris Fury parts have had no application affecting bugs

Additions/changes have been features and minor trace related fixesThis stability and lack of errors has shown the high quality of the modern ARM validation and test modelIt has also shown the value of the support that Luminary and other lead partners has given ARM in ensuring the highest quality core

Moving forwardShyam has coveredGoal oriented: focus on end users

ARM and its partners working together to get best benefitUltra low power, specific performance, specialized areas

393939Confidential

ConclusionYou may move to all C/C++ and off-the-shelf code

Assembly should be unnecessary – you can use intrinsics if neededIf coming from 8-bit/16-bit make sure using ints/unsignedOptimizer is important – size and/or performance (can mix/match)

Do not be afraid to use interruptsUse priority masking vs. interrupt disable for critical sections

Do not be afraid to use an RTOS if application suitsReduce BOM cost by reducing parts on board, reducing number of MCUs, and doing more in SWCortex-M3 based MCUs exceed quality and reliability standards

Documents

Transitioning to Cortex-M3 based MCUs - RTC Grouprtcgroup.com/arm/2007/presentations/129 - Transitioning to Cortex... · Transitioning to Cortex-M3 based MCUs Paul Kimelman - CTO