Upload
truongxuyen
View
238
Download
3
Embed Size (px)
Citation preview
1
ARM Cortex® Processors
…driving the pace of multicore innovation
Chris Turner
ESSEI TecDay, October 13th 2015
2
Markets for ARM Processor IP IoT / Embedded
Mobile / Consumer / Wearables
Enterprise / Networking / Server
Automotive / Industrial
3
ARM Cortex® Processor Profiles Three architecture variants profiled for the different application sectors
Actuation, fast control
Fast response / Real-time control
Extended Functional Safety
Cortex-R processors
MCUs, IoT, sensors, motors
RTOS
DSP
Smallest footprint / lowest power
Cortex -M processors
Computation, robotics
computer-vision
Linux®, QNX
Higher performance
Cortex-A processors
ARMv8-R
4
Cortex Architecture Profiles
Cortex-A,
ARMv7A and v8A
Cortex-R
ARMv7-R and v8-R
Cortex-M
ARMv6-M, v7-M and v8-M
Lower power, smaller area
Higher performance
RTOS only Linux/Rich OS ARMv8-R option
32/64b ARM and Thumb ISA 32b ARM and Thumb ISA 32b Thumb ISA
SW managed interrupts HW managed interrupt
Caches including TCMs TCMs in Cortex-M7
ASIL B capable ASIL D capable ASIL-D capable
Multiple level cache
Deterministic SW managed
Safety support
Operating System
Instruction set
Interrupts
CPU Memory
5
Cortex-A processor cores can operate in
a coherent cluster up to MP4
Synthesis-time choice of cores per cluster
Each core runs a process thread
e.g. Linux/Android kernel has support built in
ARM’s Generic Interrupt Controller (GIC)
distributes to the cores
OS re-programs distribution on-the-fly
Provision for inter-core interupts
Automated data cache coherency
Snoop Control Unit (SCU) includes level 2
cache, tag RAM copies and Accelerator
Coherency Port (ACP)
Processor Clusters
SCU
Core 1
D$ I$
Core 2
D$ I$
Core 3
D$ I$
Core 4
D$ I$
L2$
GIC
AXI system bus ACP
ACE
6
Highest performance core from ARM – applicable to
enterprise infrastructure, automotive, mobile,
consumer and beyond
Significant boost in power and area efficiency
Highly scalable across many different segments and
price points
Different Implementation possibilities enable optimal
solutions for different markets
Wide range of core counts possible through
advanced interconnect (CCN/CCI)
Advanced ARMv8-A feature set: 64 bit, ECC, AMBA 5
CHI, high performance FP and cryptography
Safety documentation package support for
Automotive and Industrial markets
Cortex-A72: High-end MP4 Cluster
7
High Performance ‘big’
Cortex-A9
Cortex-A15
Cortex-A17
Cortex-A72
Cortex-A7 Cortex-A5 Cortex-A53
ARMV7-A Premium performance with mid-range area & power
ARMV8-A, 64bit
Highest single thread
performance CPU
ARMV7-A
High performance 32bit
CPU with enterprise
class feature set
Highest efficiency
V8-A CPU, 64bit Highest efficiency
V7-A CPU Smallest & lowest
power v7-A CPU
Cortex-A CPU Portfolio All can be configured as 1, 2 3 or 4 MPCore clusters
High Efficiency ‘LITTLE’
Cortex-A57
ARMV8-A, 64bit
High single thread
performance CPU
8
Heterogeneous Computing
More than 40% higher User Experience*
45% to 65% CPU power savings**
Architecturally Identical Processors
High performance tuned ‘big’ cores
High efficiency tuned ‘LITTLE’ cores
Hardware Coherency
Automatically managed for all cache levels
Seamless & Automatic Task Allocation
Global Task Scheduling (big.LITTLE MP)
ARM big.LITTLE Technology Saving yet more energy by using the right core for the right task
* Compared to LITTLE-only platforms; ** Compared to big-only platforms
† Average power across high-end (Epic Citadel) gaming and low-utilisation (Audio playback) workloads
L2 Cache L2 Cache
Cache Coherent Interconnect
Interrupt Control
1 2
3 4 3 4
1 2
big Cluster
LITTLE Cluster
9
Fine Grain Market Segmentation through different CPU combinations
<$150
>$400
$200-
$350
Hexacore
big.LITTLE
Octacore
big.LITTLE
Octacore
Quad core
Dual core
Single
core
SoC Configuration D
evi
ce T
ier
Latest features,
advanced spec
Lowest Power &
footprint
big core – A15, A17, A57, A72
LITTLE core – A7, A53
Quadcore
big.LITTLE
10
CoreLink CCI-500
LITTLE
cluster
Snoop Filter
big
cluster IO
Coherent
CoreLink CCI-400
big.LITTLE employs AMBA Coherency Extensions
First Generation big.LITTLE
All coherency snoops sent to all
processors
IO
Coherent
big
cluster LITTLE
cluster
Next Generation System Coherency
Integrated Snoop Filter
Higher Efficiency and Performance
One central snoop vs many
Lower snoop latency
11
DSPDSP
ACE
Network Interconnect
NIC-400
Flash
NIC-400
USB
Memory
Controller
DMC-520
x72
DDR4-3200
AHB
Snoop Filter1-32MB L3 cache
PCIe
10-40
GbE
DPI Crypto
CoreLink™ CCN-512 Cache Coherent Network
DSP SATA
Memory
Controller
DMC-520
x72
DDR4-3200
Cortex-A72
Memory
Controller
DMC-520
x72
DDR4-3200
Memory
Controller
DMC-520
x72
DDR4-3200
PCIe
DPI
I/O Virtualisation CoreLink MMU-500
SRAM
Network Interconnect
NIC-400
GPIO PCIe
GIC-500
Cortex CPU
or CHI
master
Cortex-A53
Cortex-A72
Cortex-A53
Cortex-A72
Cortex-A53
Cortex-A72
Cortex-A53
Cortex CPU
or CHI
master
Cortex CPU
or CHI
master
Cortex CPU
or CHI
master
®
Extensible Architecture for Heterogeneous Multi-core Solutions
Up to 4
cores per
cluster
Up to 12
coherent
clusters
Integrated
L3 cache
Up to 24 I/O
coherent
interfaces for
accelerators
and I/O
Peripheral address space
Heterogeneous processors – CPU, GPU, DSP and
accelerators Virtualized Interrupts
Up to Quad
channel
DDR3/4 x72
12
Right-sized Processing Combination Examples
Wearables Storage IVI, ADAS Mobile/Consumer
Cortex-A + Cortex-M Cortex-R + Cortex-M Cortex-A + Cortex-R Cortex-A + Cortex-R + Cortex-M
Cortex-A7 Cortex-A57
Cortex-A53
Cortex-R5 Cortex-R7 Cortex-M4 Cortex-M0 Cortex-R5 Cortex-M0+
Cortex-A processors combined in big.LITTLE clusters deliver high performance and save energy
Cortex-A72
Cortex-A53
13
Cortex-R
R Cortex-R7
Cortex-R5 Cortex-R4
Real-time standard
High performance
4G modem and storage
Functional safety
package
Cortex-M
M Cortex-M4
Cortex-M3 Cortex-M0+ Cortex-M0
Cortex-M7
Low power with
maximum cost efficiency Highest energy
efficiency
Performance
efficiency
Mainstream
Control & DSP
Maximum Performance
Control & DSP
ARM Cortex-R and Cortex-M Processor Portfolio
14
Superscalar / dual issue execution throughput
Cache line buffers minimise stalling while
waiting for L2 memory system
ECC and its RMW timing is mostly transparent
Low Interrupt Latency pipeline mode
Fast interrupt response abandons any pending
and re-startable memory operations
Tightly Coupled Memory
Level-1 memory system for fast access to
code and data, e.g. Interrupt Service Routines
Low Latency Peripheral Port
Introduced in Cortex-R5
Direct paths to LSU and store queue avoid
delays in caches and AXI-Main
What Makes a Real-Time Processor Microarchitecture Typically Cortex-R5
Execution units
D cache
Ta
g
D eviction buffer
Store buffer
Load-Store Unit
I cache
Ta
g
Pre-fetch
I q
ue
ue
De
co
de
an
d is
su
e
D line fill buffers
TCM interface
AXI interface
Store queue
I line fill buffer
Interrupt
control
Dirty RAM
TCM
ports
AXI-3
bus
Branch
predictor
IRQ
FIQ
AXI i/f LLPP
Dete
rmin
isti
c
15
Cortex-A
For applications
Single flat memory
Virtual memory, MMU
Address translation
TLB acceleration
Page table manager
Memory protection
Program relocation
‘Open’ systems
Real-Time Memory Address Map VMSA - PMSA
Cortex-R
For real-time
Specialised memories
e.g. TCM, LLRAM, Main
Position-dependent code
Deterministic behaviour
Memory protection
‘Composed’ systems
ISR
DataTask
Data
ISR
Data
Task
Data
RTOS
Data
Task
Data
Task
Data
Task
Data
Task
Data
TCM
LLRAM
Tasks 1 - n
L1 cache
L2 cache
Application
1
Application
2
Application
n
App
Data
App
Data
App
Data
App
Data
App
Data
Application
3
OS
Data
L1, L2...
caches
16
High performance1.68 DMIPS/MHz
8-stage dual-issue pipeline, pre-fetch, branch prediction
HDIV, SIMD, SP/DP FPU, ARMv7R Thumb2/ARM instructions
Deterministic response to hard deadlines
Low Interrupt Latency microarchitecture
Tightly Coupled Memory (TCM)
Reliable with fault detection and control features
MPU, ECC/Parity on L1 memories, Dual-Core Lock-Step
ECC and Parity also on AXI bus port interfaces
Support for safety-related applications
Cost Effective – synthesis configurable for optimum PPA
Low Latency Peripheral Port (LLPP)
Non-blocking access to I/O registers and GIC
Accelerator Coherency Port (ACP)
Performance boosting data cache maintenance
Cortex-R5 Processor Dependable and proven real-time performance
Single or dual core configuration
17
Two cores can be used either in lock-step or
‘performance’ mode
In performance mode both cores act as bus masters
In lock-step one core provides a redundant copy
whilst a single instance of the cache and TCM RAMs
is protected with ECC
i/o coherency but no inter-processor coherency
An external data source can write through the SCU
and coherency is maintained simply by invalidating
cache lines holding addresses being written
Such hardware automated cache maintenance is very
beneficial in many real-time applications
Dual-core Cortex-R5 Support for safety-related applications
microSCU
Core 1
D$ I$
Core 2
D$ I$
GIC
AXI system bus ACP
18
Cortex-R5 Fault Detection & Control Features 6
4-b
its
EC
C b
its
32
-bits
EC
C b
its
64-b
its
EC
C b
its
64-b
its
EC
C b
its
64-b
its
EC
C b
its
64-b
its
EC
C b
its
64-b
its
EC
C b
its
64-b
its
EC
C b
its
64-b
its
EC
C b
its
32
-bits
EC
C b
its
32
-bits
EC
C b
its
32
-bits
EC
C b
its
32
-bits
EC
C b
its
32
-bits
EC
C b
its
32
-bits
EC
C b
its
32
-bits
EC
C b
its
ECC
detect/
correct
ECC
detect/
correct
ECC
generate
RMW
if <32b
CPU
I
D
ECC
generator
ECC
corrector
ECC
generator
Parity
generator
Parity
checker
Parity
generator
ECC
corrector
ECC
generator
ECC
corrector
Parity
checker
Parity
checker
Parity
checker
Inte
rco
nn
ect lo
gic
Co
rte
x-R
5 P
roce
sso
r
Pe
rip
he
rals
/Me
mo
ry
Da
ta (
an
d In
str
uctio
ns)
Ad
dre
ss &
Co
ntr
ol
ECC
Data
ECC
Data
Parity bit
Parity bit
Addr/Ctrl
Addr/Ctrl
First read
InvalidateRe-read
Re-read
CacheLevel 2
memoryCorrect
Evict
Note cache
line to be
avoided
First read
Correct
Re-read
TCM
Corrected
chunk
Error Correcting Code, Cache & TCM
Single Error Correct – Double Error Detect
64-bit scheme is most efficient for I-side
32-bit scheme is best for D-side to minimize
Read-Modify-Write cycles
RMWs to re-calculate ECC when writing a
quantity smaller than memory chunk size
RMWs performed automatically with minimal, or
even zero, performance impact
Hard Errors in Cache and TCM
Hard errors cannot be corrected by writing back
corrected data and repeat when memory is read again
‘Live-lock’ scenario when uncorrected instructions or
data are continuously re-fetched
Cache line
avoidance
Hard error cache
Dual Core Lock Step
Both spatial (also orientation) and temporal separation
Avoiding common cause failures, i.e. reduced probability of both CPUs
seeing the same failure at the same time and still checking OK
Bus ECC
ECC and Parity are generated, detected and corrected
Interconnect ‘veneered’ with same ECC/P functionality
ARMv7-R Architecture
Protected Memory System Architecture. Precise aborts Delay
Delay
Delay
CP
UC
op
y
Delay
MainCPU
Inputs Outputs
Fault
L1 Memory
Checker
19
Collaborative project funded by European
Commission H2020 Space program
Start in Feb 2015 with a two year duration
Project Objectives
Investigate feasibility of a fail-functional ARM CPU
using the triple core lockstep (TCLS) principle
Target rad-tolerant space and safety-critical
terrestrial applications
Assess the fail functional design using rad-tolerant
STM65nm technology
Concept
Three ARM CPUs execute in lockstep
Fail functional – Resynchronize upon divergence
Shared ICache, DCache and memory
Research Project: TCLS ARM for Space
http://www.tcls-arm-for-space.eu/
20
High-performance processor with DSP capabilities
Six-stage superscalar pipeline with powerful DSP and SP/DP FP
Best-in-class core for high-end MCU or replace MCU+DSP
Flexible powerful memory system
Tightly-Coupled Memories for real-time determinism
64-bit AXI AMBA4 memory interface with I and D cache
Next-gen MCU with more memories and peripherals
ARMv7-M architecture and CMSIS support
100% binary compatibility from Cortex-M4
Cortex-M family ease-of-use and very low interrupt latency
Reuse code and system design from existing products
Fault detection and control features
MPU, memory ECC (SEC-DED), on-line MBIST, DCLS
The latest Cortex-M7 processor from ARM
21
Thank You
The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU
and/or elsewhere. All rights reserved. Any other marks featured may be trademarks of their respective owners