21
1 ARM Cortex ® Processors driving the pace of multicore innovation Chris Turner ESSEI TecDay, October 13 th 2015

ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

Embed Size (px)

Citation preview

Page 1: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

1

ARM Cortex® Processors

…driving the pace of multicore innovation

Chris Turner

ESSEI TecDay, October 13th 2015

Page 3: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

3

ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors

Actuation, fast control

Fast response / Real-time control

Extended Functional Safety

Cortex-R processors

MCUs, IoT, sensors, motors

RTOS

DSP

Smallest footprint / lowest power

Cortex -M processors

Computation, robotics

computer-vision

Linux®, QNX

Higher performance

Cortex-A processors

ARMv8-R

Page 4: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

4

Cortex Architecture Profiles

Cortex-A,

ARMv7A and v8A

Cortex-R

ARMv7-R and v8-R

Cortex-M

ARMv6-M, v7-M and v8-M

Lower power, smaller area

Higher performance

RTOS only Linux/Rich OS ARMv8-R option

32/64b ARM and Thumb ISA 32b ARM and Thumb ISA 32b Thumb ISA

SW managed interrupts HW managed interrupt

Caches including TCMs TCMs in Cortex-M7

ASIL B capable ASIL D capable ASIL-D capable

Multiple level cache

Deterministic SW managed

Safety support

Operating System

Instruction set

Interrupts

CPU Memory

Page 5: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

5

Cortex-A processor cores can operate in

a coherent cluster up to MP4

Synthesis-time choice of cores per cluster

Each core runs a process thread

e.g. Linux/Android kernel has support built in

ARM’s Generic Interrupt Controller (GIC)

distributes to the cores

OS re-programs distribution on-the-fly

Provision for inter-core interupts

Automated data cache coherency

Snoop Control Unit (SCU) includes level 2

cache, tag RAM copies and Accelerator

Coherency Port (ACP)

Processor Clusters

SCU

Core 1

D$ I$

Core 2

D$ I$

Core 3

D$ I$

Core 4

D$ I$

L2$

GIC

AXI system bus ACP

ACE

Page 6: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

6

Highest performance core from ARM – applicable to

enterprise infrastructure, automotive, mobile,

consumer and beyond

Significant boost in power and area efficiency

Highly scalable across many different segments and

price points

Different Implementation possibilities enable optimal

solutions for different markets

Wide range of core counts possible through

advanced interconnect (CCN/CCI)

Advanced ARMv8-A feature set: 64 bit, ECC, AMBA 5

CHI, high performance FP and cryptography

Safety documentation package support for

Automotive and Industrial markets

Cortex-A72: High-end MP4 Cluster

Page 7: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

7

High Performance ‘big’

Cortex-A9

Cortex-A15

Cortex-A17

Cortex-A72

Cortex-A7 Cortex-A5 Cortex-A53

ARMV7-A Premium performance with mid-range area & power

ARMV8-A, 64bit

Highest single thread

performance CPU

ARMV7-A

High performance 32bit

CPU with enterprise

class feature set

Highest efficiency

V8-A CPU, 64bit Highest efficiency

V7-A CPU Smallest & lowest

power v7-A CPU

Cortex-A CPU Portfolio All can be configured as 1, 2 3 or 4 MPCore clusters

High Efficiency ‘LITTLE’

Cortex-A57

ARMV8-A, 64bit

High single thread

performance CPU

Page 8: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

8

Heterogeneous Computing

More than 40% higher User Experience*

45% to 65% CPU power savings**

Architecturally Identical Processors

High performance tuned ‘big’ cores

High efficiency tuned ‘LITTLE’ cores

Hardware Coherency

Automatically managed for all cache levels

Seamless & Automatic Task Allocation

Global Task Scheduling (big.LITTLE MP)

ARM big.LITTLE Technology Saving yet more energy by using the right core for the right task

* Compared to LITTLE-only platforms; ** Compared to big-only platforms

† Average power across high-end (Epic Citadel) gaming and low-utilisation (Audio playback) workloads

L2 Cache L2 Cache

Cache Coherent Interconnect

Interrupt Control

1 2

3 4 3 4

1 2

big Cluster

LITTLE Cluster

Page 9: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

9

Fine Grain Market Segmentation through different CPU combinations

<$150

>$400

$200-

$350

Hexacore

big.LITTLE

Octacore

big.LITTLE

Octacore

Quad core

Dual core

Single

core

SoC Configuration D

evi

ce T

ier

Latest features,

advanced spec

Lowest Power &

footprint

big core – A15, A17, A57, A72

LITTLE core – A7, A53

Quadcore

big.LITTLE

Page 10: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

10

CoreLink CCI-500

LITTLE

cluster

Snoop Filter

big

cluster IO

Coherent

CoreLink CCI-400

big.LITTLE employs AMBA Coherency Extensions

First Generation big.LITTLE

All coherency snoops sent to all

processors

IO

Coherent

big

cluster LITTLE

cluster

Next Generation System Coherency

Integrated Snoop Filter

Higher Efficiency and Performance

One central snoop vs many

Lower snoop latency

Page 11: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

11

DSPDSP

ACE

Network Interconnect

NIC-400

Flash

NIC-400

USB

Memory

Controller

DMC-520

x72

DDR4-3200

AHB

Snoop Filter1-32MB L3 cache

PCIe

10-40

GbE

DPI Crypto

CoreLink™ CCN-512 Cache Coherent Network

DSP SATA

Memory

Controller

DMC-520

x72

DDR4-3200

Cortex-A72

Memory

Controller

DMC-520

x72

DDR4-3200

Memory

Controller

DMC-520

x72

DDR4-3200

PCIe

DPI

I/O Virtualisation CoreLink MMU-500

SRAM

Network Interconnect

NIC-400

GPIO PCIe

GIC-500

Cortex CPU

or CHI

master

Cortex-A53

Cortex-A72

Cortex-A53

Cortex-A72

Cortex-A53

Cortex-A72

Cortex-A53

Cortex CPU

or CHI

master

Cortex CPU

or CHI

master

Cortex CPU

or CHI

master

®

Extensible Architecture for Heterogeneous Multi-core Solutions

Up to 4

cores per

cluster

Up to 12

coherent

clusters

Integrated

L3 cache

Up to 24 I/O

coherent

interfaces for

accelerators

and I/O

Peripheral address space

Heterogeneous processors – CPU, GPU, DSP and

accelerators Virtualized Interrupts

Up to Quad

channel

DDR3/4 x72

Page 12: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

12

Right-sized Processing Combination Examples

Wearables Storage IVI, ADAS Mobile/Consumer

Cortex-A + Cortex-M Cortex-R + Cortex-M Cortex-A + Cortex-R Cortex-A + Cortex-R + Cortex-M

Cortex-A7 Cortex-A57

Cortex-A53

Cortex-R5 Cortex-R7 Cortex-M4 Cortex-M0 Cortex-R5 Cortex-M0+

Cortex-A processors combined in big.LITTLE clusters deliver high performance and save energy

Cortex-A72

Cortex-A53

Page 13: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

13

Cortex-R

R Cortex-R7

Cortex-R5 Cortex-R4

Real-time standard

High performance

4G modem and storage

Functional safety

package

Cortex-M

M Cortex-M4

Cortex-M3 Cortex-M0+ Cortex-M0

Cortex-M7

Low power with

maximum cost efficiency Highest energy

efficiency

Performance

efficiency

Mainstream

Control & DSP

Maximum Performance

Control & DSP

ARM Cortex-R and Cortex-M Processor Portfolio

Page 14: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

14

Superscalar / dual issue execution throughput

Cache line buffers minimise stalling while

waiting for L2 memory system

ECC and its RMW timing is mostly transparent

Low Interrupt Latency pipeline mode

Fast interrupt response abandons any pending

and re-startable memory operations

Tightly Coupled Memory

Level-1 memory system for fast access to

code and data, e.g. Interrupt Service Routines

Low Latency Peripheral Port

Introduced in Cortex-R5

Direct paths to LSU and store queue avoid

delays in caches and AXI-Main

What Makes a Real-Time Processor Microarchitecture Typically Cortex-R5

Execution units

D cache

Ta

g

D eviction buffer

Store buffer

Load-Store Unit

I cache

Ta

g

Pre-fetch

I q

ue

ue

De

co

de

an

d is

su

e

D line fill buffers

TCM interface

AXI interface

Store queue

I line fill buffer

Interrupt

control

Dirty RAM

TCM

ports

AXI-3

bus

Branch

predictor

IRQ

FIQ

AXI i/f LLPP

Dete

rmin

isti

c

Page 15: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

15

Cortex-A

For applications

Single flat memory

Virtual memory, MMU

Address translation

TLB acceleration

Page table manager

Memory protection

Program relocation

‘Open’ systems

Real-Time Memory Address Map VMSA - PMSA

Cortex-R

For real-time

Specialised memories

e.g. TCM, LLRAM, Main

Position-dependent code

Deterministic behaviour

Memory protection

‘Composed’ systems

ISR

DataTask

Data

ISR

Data

Task

Data

RTOS

Data

Task

Data

Task

Data

Task

Data

Task

Data

TCM

LLRAM

Tasks 1 - n

L1 cache

L2 cache

Application

1

Application

2

Application

n

App

Data

App

Data

App

Data

App

Data

App

Data

Application

3

OS

Data

L1, L2...

caches

Page 16: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

16

High performance1.68 DMIPS/MHz

8-stage dual-issue pipeline, pre-fetch, branch prediction

HDIV, SIMD, SP/DP FPU, ARMv7R Thumb2/ARM instructions

Deterministic response to hard deadlines

Low Interrupt Latency microarchitecture

Tightly Coupled Memory (TCM)

Reliable with fault detection and control features

MPU, ECC/Parity on L1 memories, Dual-Core Lock-Step

ECC and Parity also on AXI bus port interfaces

Support for safety-related applications

Cost Effective – synthesis configurable for optimum PPA

Low Latency Peripheral Port (LLPP)

Non-blocking access to I/O registers and GIC

Accelerator Coherency Port (ACP)

Performance boosting data cache maintenance

Cortex-R5 Processor Dependable and proven real-time performance

Single or dual core configuration

Page 17: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

17

Two cores can be used either in lock-step or

‘performance’ mode

In performance mode both cores act as bus masters

In lock-step one core provides a redundant copy

whilst a single instance of the cache and TCM RAMs

is protected with ECC

i/o coherency but no inter-processor coherency

An external data source can write through the SCU

and coherency is maintained simply by invalidating

cache lines holding addresses being written

Such hardware automated cache maintenance is very

beneficial in many real-time applications

Dual-core Cortex-R5 Support for safety-related applications

microSCU

Core 1

D$ I$

Core 2

D$ I$

GIC

AXI system bus ACP

Page 18: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

18

Cortex-R5 Fault Detection & Control Features 6

4-b

its

EC

C b

its

32

-bits

EC

C b

its

64-b

its

EC

C b

its

64-b

its

EC

C b

its

64-b

its

EC

C b

its

64-b

its

EC

C b

its

64-b

its

EC

C b

its

64-b

its

EC

C b

its

64-b

its

EC

C b

its

32

-bits

EC

C b

its

32

-bits

EC

C b

its

32

-bits

EC

C b

its

32

-bits

EC

C b

its

32

-bits

EC

C b

its

32

-bits

EC

C b

its

32

-bits

EC

C b

its

ECC

detect/

correct

ECC

detect/

correct

ECC

generate

RMW

if <32b

CPU

I

D

ECC

generator

ECC

corrector

ECC

generator

Parity

generator

Parity

checker

Parity

generator

ECC

corrector

ECC

generator

ECC

corrector

Parity

checker

Parity

checker

Parity

checker

Inte

rco

nn

ect lo

gic

Co

rte

x-R

5 P

roce

sso

r

Pe

rip

he

rals

/Me

mo

ry

Da

ta (

an

d In

str

uctio

ns)

Ad

dre

ss &

Co

ntr

ol

ECC

Data

ECC

Data

Parity bit

Parity bit

Addr/Ctrl

Addr/Ctrl

First read

InvalidateRe-read

Re-read

CacheLevel 2

memoryCorrect

Evict

Note cache

line to be

avoided

First read

Correct

Re-read

TCM

Corrected

chunk

Error Correcting Code, Cache & TCM

Single Error Correct – Double Error Detect

64-bit scheme is most efficient for I-side

32-bit scheme is best for D-side to minimize

Read-Modify-Write cycles

RMWs to re-calculate ECC when writing a

quantity smaller than memory chunk size

RMWs performed automatically with minimal, or

even zero, performance impact

Hard Errors in Cache and TCM

Hard errors cannot be corrected by writing back

corrected data and repeat when memory is read again

‘Live-lock’ scenario when uncorrected instructions or

data are continuously re-fetched

Cache line

avoidance

Hard error cache

Dual Core Lock Step

Both spatial (also orientation) and temporal separation

Avoiding common cause failures, i.e. reduced probability of both CPUs

seeing the same failure at the same time and still checking OK

Bus ECC

ECC and Parity are generated, detected and corrected

Interconnect ‘veneered’ with same ECC/P functionality

ARMv7-R Architecture

Protected Memory System Architecture. Precise aborts Delay

Delay

Delay

CP

UC

op

y

Delay

MainCPU

Inputs Outputs

Fault

L1 Memory

Checker

Page 19: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

19

Collaborative project funded by European

Commission H2020 Space program

Start in Feb 2015 with a two year duration

Project Objectives

Investigate feasibility of a fail-functional ARM CPU

using the triple core lockstep (TCLS) principle

Target rad-tolerant space and safety-critical

terrestrial applications

Assess the fail functional design using rad-tolerant

STM65nm technology

Concept

Three ARM CPUs execute in lockstep

Fail functional – Resynchronize upon divergence

Shared ICache, DCache and memory

Research Project: TCLS ARM for Space

http://www.tcls-arm-for-space.eu/

Page 20: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

20

High-performance processor with DSP capabilities

Six-stage superscalar pipeline with powerful DSP and SP/DP FP

Best-in-class core for high-end MCU or replace MCU+DSP

Flexible powerful memory system

Tightly-Coupled Memories for real-time determinism

64-bit AXI AMBA4 memory interface with I and D cache

Next-gen MCU with more memories and peripherals

ARMv7-M architecture and CMSIS support

100% binary compatibility from Cortex-M4

Cortex-M family ease-of-use and very low interrupt latency

Reuse code and system design from existing products

Fault detection and control features

MPU, memory ECC (SEC-DED), on-line MBIST, DCLS

The latest Cortex-M7 processor from ARM

Page 21: ARM Cortex Processors - fortiss ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors Actuation, fast control Fast response / Real-time

21

Thank You

The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU

and/or elsewhere. All rights reserved. Any other marks featured may be trademarks of their respective owners