Upload
vuongbao
View
220
Download
4
Embed Size (px)
Citation preview
08/12/11
Real-Time Systems
Hardware
Hermann Härtig, Marcus Völp
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 2
Outline
● Hardware: Source of Unpredictability● Caches● Pipeline● Interrupt Latency● System Management Mode
● Special Purpose Hardware for “Embedded” Real-Time Systems
● Use unpredictable Hardware more predictably
● Real-Time Communication / Buses in separate Lecture
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 3
Sources of Unpredictability
● Memory Subsystem● TLB● Caches● Store Buffers
0100301003
store R1, 0x01003025
025025
0100301003 302302TLB:
302302 025025
PT WalkerPT Walkermiss
hit
Virtual address:
Physical address:
30203020
3020 2 5
L1 Cache:
Level 2 - CacheLevel 2 - Cache
Main MemoryMain Memory
0x10030250x1003025 ^R1^R1
Memory ControllerMemory Controller
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 4
Sources of Unpredictability
● Processor Pipeline (ARM 11)
DEDEIFIFIFIF ISSISS ALUALU
MACMAC MACMAC MACMAC
ADDADD DCDC DCDC
WBWB
WBWB
WBWB
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 5
Sources of Unpredictability
● Interrupt Latency
DeviceDevice
Interrupt ControllerInterrupt Controller
CPUCPU
incoming signal
interrupts enabled?
await instruction boundary
kernel entry
select handler code (index into interrupt vector table)
handle_irq: ... return from interrupt
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 6
Interrupt Response Time – RTLinux
Interrupt response time:Time from occurrence of interrupt to first instruction of RT Task
No parallel load (idle): 13 µs
High parallel load: 68 µs (Benchmark, Cache-Flodder)
Measured on Intel P4 1.6 GHz
0.0001
0.001
0.01
0.1
1
0 5 10 15 20 25 30 35 40 45 50
Frequ
ency
IRQ occurence rate (µs)
log.fl.josephina.rtlinux: a
0.0001
0.001
0.01
0.1
1
0 5 10 15 20 25 30 35 40 45 50
Frequ
ency
IRQ occurence rate (µs)
log.t3.josephina.rtlinux: lat_proc_sh_dynamic
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 7
Interrupt Response Time – L4RTL (+AS switch)
Interrupt response time:Time from occurrence of interrupt to first instruction of RT Task
No parallel load (idle): 43 µs
High parallel load: 85 µs (Benchmark, Cache-Flooder)
Measured on Intel P4 1.6 GHz
0.0001
0.001
0.01
0.1
1
0 5 10 15 20 25 30 35 40 45 50
Frequ
ency
IRQ occurence rate (µs)
log.fl.josephina.fiasco: a
0.0001
0.001
0.01
0.1
1
0 5 10 15 20 25 30 35 40 45 50
Frequ
ency
IRQ occurence rate (µs)
log.t3.josephina.fiasco: lat_proc_sh_dynamic
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 8
System Management Mode (SMM)
● PC platforms
● “sits” underneath operating system
● Invoked using non-maskable interrupt
● Used for platform specifcs, correction of design errors,
thermal management
● Can be switched off, but better do not try
● Adds unpredictable delays
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 9
Outline
● Hardware is Source of Unpredictability● Special Purpose Hardware for “Embedded” Real-Time
Systems● Low Latency Interrupt Mode● Peripheral Event Controller● Capture – Compare Units● Scratchpad Memories● Real-Time Clocks
● How to use unpredictable Hardware more predictably
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 10
Low Latency Interrupts ...
… are considered of primary importance ● Sampling of Data in a High Rate (Sensors, Video, ...)● Fast Response Times (Break Control, …)● Part of context switch time .....
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 11
Low Latency Interrupts ...
… are considered of primary importance ● Sampling of Data in a High Rate (Sensors, Video, ...)● Fast Response Times (Break Control, …)● Part of context switch time .....
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 12
Low-Latency Interrupts● Two Interrupt modi: ARM IRQ + FIQ
● FIQ interrupts IRQ handler● 5 registers immediately available for FIQ handler
no need to save register content
● ARM Low Interrupt Latency Confguration● Minimize worst case interrupt latency, recommendations:
– disable Hit-under-Miss in Cache– abandon pending restartable memory OPs
restart memory OP on return from interrupt– do not use multi-word load / store instructions– avoid accesses to slow memory
(device memory, strong ordering of memory accesses)
● Worst Case FIQ Latency (PL190 VIC) : – Interrupt synchronization 3 cycles– Worst case execution time of current instruction 7 cycles– Entry to frst instruction 2 cycles
SUM: 12 cycles
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 13
Low Latency Interrupts (2)
Peripheral Event Controller (PEC)● Memory ↔ io-register move ● triggered by signal
● Programmers Interface:– counter decrement on signal,
trigger CPU interrupt on “counter==0”– byte / word select select width of transfer (byte or word)– source / dest increase update source / destination address– source / dest address
● Semantics expressed as Signal-Handler:
signal () { if (counter--) *dest++ = *src; => minimal response time (~3 cycles) else => PEC can execute at external clock rate trigger_interrupt();
}
Example: SAB 80C166 (Siemens, 1990)
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 14
Timestamps / Event triggers
● Precise timestamps / trigger times with interrupt-based sampling is difficult
● system load influences jitter in interrupt response time● atomic sections in OS delay interrupts● jitter in signal delivery (from signal source to CPU interrupt pin)
– buses, interrupt controller
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 15
Capture + Compare Unit
● Problem● precise timestamp of event
(engine sensor triggered at cylinder position = t)● trigger events at precise points in time
(fre the engine at t + x)
● interrupt handlers' jitter too high to meet the points in time
Capture Compareevent interrupt
ClockClock
Capture RegisterCapture Register Compare RegisterCompare Register
ClockClock
handler
== event
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 16
Combination PEC, Capture-Compare
● Compare + PEC: highly efficient triggering of events
● Capture + PEC: highly efficient sampling
● no SW intervention
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 17
System on a Chip / Chip Multiprocessor
● Processor Chip contains entire system● CPU Cores / Peripheral Components / Memory available
as masks for weaver● Confgure system as required
● Examples:● Mobile Phones today:
General Purpose Core (e.g., ARM) + Baseband DSP Processor
● Game Console (Cell): General Purpose (PPC) + special purpose Graphics
● Sensor + 8-bit Controller + CAN bus
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 18
Scratch-pad Memory vs. Caches
● Synonyms: [Scratchpad / On-Chip / Tightly-Coupled] Memory
● Cache: ● transparent addressing scheme● cache controller flls / replaces cachelines in parallel to
CPU activity● difficult to predict worst-case execution time
L1
L2
Main Memory
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 19
Scratch-pad Memory vs. Caches (2)
● Synonyms: [Scratchpad / On-Chip / Tightly-Coupled] Memory
● Cache: ● transparent addressing scheme● cache controller flls / replaces cachelines in parallel to
CPU activity● difficult to predict worst-case execution time
L1
L2
Main Memory
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 20
Scratch-pad Memory vs. Caches (3)
● Synonyms: [Scratchpad / On-Chip / Tightly-Coupled] Memory
● Cache: ● transparent addressing scheme● cache controller flls / replaces cachelines in parallel to
CPU activity● difficult to predict worst-case execution time
L1
L2
Main Memory
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 21
Scratch-pad Memory vs. Caches (4)
● Synonyms: [Scratchpad / On-Chip / Tightly-Coupled] Memory
● Cache: ● transparent addressing scheme● cache controller flls / replaces cachelines in parallel to
CPU activity● difficult to predict worst-case execution time
L1
L2
Main Memory
CPUCPU
1 cycle ~10 cycles
~50 cycles
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 22
Scratch-pad Memory vs. Caches (5)
● Synonyms: [Scratchpad / On-Chip / Tightly-Coupled] Memory
● Cache: ● transparent addressing scheme● cache controller flls / replaces cachelines in parallel to
CPU activity● difficult to predict worst-case execution time
L1
L2
Main Memory
CPUCPU
1 cycle ~10 cycles
~50 cycles
Clean: ~ 60 cyclesDirty: ~ 170 cycles
(Old Pentium 3)
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 23
Scratch-pad Memory vs. Caches (6)
Set
= =
Tag Data
Way 1 Way 2
Cache lock:
● Lock complete cache way● Allocate to unlocked cache ways only
Cache line
Terminology:* set* tag* way* cache line
Address example needed ........ for explanation
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 24
Scratch-pad Memory vs. Caches (7)
● Scratchpad Memory:● memory is addressed directly
(device mapped to physical memory space)● 2 cycles latency / currently ~ 64 KB (more ~4 MB to
come)● must explicitly manage data allocation● Compiler-based approaches:
– static: determine code + data hot spots + allocate scratchpad memory for hot spots
– dynamic: copy data to / from scratchpad memory
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 25
Scratch-pad Memory vs. Caches (8)
● ARM Tightly Coupled Memory:
RAMRAM TCMTCM
TCMTCM
RAMRAM
● overlay normal RAM with TCM
● simplifed cache logic for TCM memory
● keeps track whether RAM or TCM holds current value
● on miss: copies data from underlying RAM to TCM
● Which are the differences ??
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 26
Real Time Clocks
● Multiple Clocks in SoC● Core Cycle Count 200 MHz fne grained
+ fast to access
– accuracy ; clock skew ; subject to power management (dvfs)● Audio Codec● PCI 33 MHz/ 66 MHz● Timer● RTC 32.768 kHz ; high precision ;
small clock skew
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 27
Outline
● Hardware is Source of Unpredictability● Special Purpose Hardware for “Embedded” Real-Time
Systems● Use unpredictable Hardware more predictable
● Cache Partitioning● Real-Time Communication / Buses in separate Lecture
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 28
(OS-Controlled) Cache Partitioning
● Goal● partition cache to minimize interference between
RT/RT or RT/NRT applications
● General method:● address regions map to cache sets
● Control allocation to address regions
● Approaches● Compiler/linker
● OS controlled– transparent to application
– no need to rely on cooperating applications
isolates malicious, erroneous applications
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 29
RT and Non-RT
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 30
Cache Partitioning
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 31
Cache Partitioning
v : vpn
r : rpn
virtual address
(vpn: virtual page number)
real address
MMUL1
Cache
L2 Cache
Memory
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 32
rpn
L2 Cache
rpn
tag co
nten
t
… cachelines
{ index
=
{
{
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 33
rpn
tag co
nten
t
{ index
={
{
tag co
nten
t
=
2 way associativity
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 34
1L Cache
2L Cache
Main
Memory
ppn off
000000
010000
020000
030000
040000
050000
060000
070000
100000
110000
120000
130000
140000
150000
160000
170000
001000
011000
021000
031000
041000
051000
061000
071000
101000
111000
121000
131000
141000
151000
161000
171000
002000
012000
022000
032000
042000
052000
062000
072000
102000
112000
122000
132000
142000
152000
162000
172000
003000
013000
023000
033000
043000
053000
063000
073000
103000
113000
123000
133000
143000
153000
163000
173000
004000
014000
024000
034000
044000
054000
064000
074000
104000
114000
124000
134000
144000
154000
164000
174000
005000
015000
025000
035000
045000
055000
065000
075000
105000
115000
125000
135000
145000
155000
165000
175000
006000
016000
026000
036000
046000
056000
066000
076000
106000
116000
126000
136000
146000
156000
166000
176000
007000
017000
027000
037000
047000
057000
067000
077000
107000
117000
127000
137000
147000
157000
167000
177000
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 35
1L Cache
2L Cache
Main
Memory
ppn off
000000
010000
020000
030000
040000
050000
060000
070000
100000
110000
120000
130000
140000
150000
160000
170000
001000
011000
021000
031000
041000
051000
061000
071000
101000
111000
121000
131000
141000
151000
161000
171000
002000
012000
022000
032000
042000
052000
062000
072000
102000
112000
122000
132000
142000
152000
162000
172000
003000
013000
023000
033000
043000
053000
063000
073000
103000
113000
123000
133000
143000
153000
163000
173000
004000
014000
024000
034000
044000
054000
064000
074000
104000
114000
124000
134000
144000
154000
164000
174000
005000
015000
025000
035000
045000
055000
065000
075000
105000
115000
125000
135000
145000
155000
165000
175000
006000
016000
026000
036000
046000
056000
066000
076000
106000
116000
126000
136000
146000
156000
166000
176000
007000
017000
027000
037000
047000
057000
067000
077000
107000
117000
127000
137000
147000
157000
167000
177000
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 36
1L Cache
2L Cache
Main
Memory
ppn off
000000
010000
020000
030000
040000
050000
060000
070000
100000
110000
120000
130000
140000
150000
160000
170000
001000
011000
021000
031000
041000
051000
061000
071000
101000
111000
121000
131000
141000
151000
161000
171000
002000
012000
022000
032000
042000
052000
062000
072000
102000
112000
122000
132000
142000
152000
162000
172000
003000
013000
023000
033000
043000
053000
063000
073000
103000
113000
123000
133000
143000
153000
163000
173000
004000
014000
024000
034000
044000
054000
064000
074000
104000
114000
124000
134000
144000
154000
164000
174000
005000
015000
025000
035000
045000
055000
065000
075000
105000
115000
125000
135000
145000
155000
165000
175000
006000
016000
026000
036000
046000
056000
066000
076000
106000
116000
126000
136000
146000
156000
166000
176000
007000
017000
027000
037000
047000
057000
067000
077000
107000
117000
127000
137000
147000
157000
167000
177000
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 37
v : vpn
r : rpn
virtual address
real address
MMU
L2 Cache
L1 Cache
Memoryr´: rpnhardware address
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 38
OS Controlled Cache Partitioning (2)
Cache Coloring
16 bit address (binary representation): 1001 0011 0010 0100
0x93
offset into 32 byte cachelineidxtag
+0 +32 +64 +96 +128 +160 +192 +224
Cach
e S
ize
Farben besser erklären
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 39
OS-Controlled Cache Partitioning
Address translation and caches
0 10000 10000000 0000 0011 0000 000000 0000 0011 0000 00 10 1000 01110 1000 011
1000 0110 10001000 0110 1000
0x30a0x30aphysical frame number
864864
8648640x400030x40003virtual address:
physical address:
virtual page number
MMUMMU
0000 0000 0011 0000 00100000 0000 0011 0000 0010offset
tag idx ofs
1010
1010
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 40
OS Controlled Cache Partitioning (4)
Cache Coloring
Address translation and caches
0 10000 10000000 0000 0011 0000 000000 0000 0011 0000 00 10 1000 01110 1000 011
1000 0110 10001000 0110 1000
0x30a0x30aphysical frame number
864864
8648640x400030x40003virtual address:
physical address:
virtual page number
MMUMMU
0000 0000 0011 0000 00100000 0000 0011 0000 0010offset
tag idx ofs L1 Cache:64KB, 4 Way, 32byte CL2 bits:
● subject to address translation● evaluated to determine cache set => assign different colors to different RT + non RT Apps. => allocate in the OS memory frames of respective color
1010
1010
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 41
OS-Controlled Cache Partitioning
Scenario● high priority real-time task (color = 00)● runs in frequent short intervals
● other tasks in the backgound (e.g., cache flodder)
Example: Filter
● Worst case:● without cache partitioning:
background tasks may evict all cachelines background tasks may load conflicting cachelines, which need
writeback
Flush FlushFlush
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 42
OS-Controlled Cache Partitioning
1 0 0 %
1 5 0 %
2 0 0 %
2 5 0 %
3 0 0 %
3 5 0 %
4 0 0 %
4 8 1 6 3 2 6 4 1 2 8 2 5 6 5 1 2 1 0 2 4 2 0 4 8 4 0 9 6
u n p a r t i t i o n e d
s n g l w r p a r t
p a r t i n t i o n e d
Filter
PentiumPro, 133 MHz
(sngl wr = single writer – only RT app may write to pages of color x ; others may read)
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 43
OS-Controlled Cache Partitioning
Example 2: low priority task; frequently interrupted e.g., matrix multiplication
64 x 64 Matrix Multiplication
643 x 5.51 cycles: 10.9 ms
Flush Flush Flush Flush
timeslice
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 44
1 0 0 %
2 0 0 %
3 0 0 %
4 0 0 %
5 0 0 %
6 0 0 %
7 0 0 %
8 0 0 %
9 0 0 %
1 0 0 0 %
1 1 0 0 %
1 2 0 0 %
1 3 0 0 %
1 4 0 0 %
1 5 0 0 %
1 6 0 0 %
1 7 0 0 %
5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 4 0 0 4 5 0
u n p a r t i t i o n e d
s n g l w r p a r t
p a r t i t i o n e d
1 0 0 %
1 2 0 %
1 4 0 %
1 6 0 %
1 8 0 %
2 0 0 %
2 2 0 %
2 4 0 %
2 6 0 %
3 0 0 4 0 0 5 0 0 6 4 0 7 6 8 8 9 6 1 0 2 4
OS-Controlled Cache Partitioning
64 x 64 Matrix Multiplication
Pentium, 133 MHz
[µs]
[µs]
timeslice
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 45
100%
105%
110%
115%
120%
125%
130%
135%
300 400 500 640 768 896 1024
1 0 0 %
1 5 0 %
2 0 0 %
2 5 0 %
3 0 0 %
3 5 0 %
4 0 0 %
4 5 0 %
5 0 0 %
5 0 1 0 0 1 5 0 2 0 0 2 5 0 3 0 0 3 5 0 4 0 0 4 5 0
u n p a r t i t i o n e d
p a r t i t i o n e d
OS-Controlled Cache Partitioning
64 x 64 Matrix Multiplication
Pentium, 133 MHz
write through
[µs]timeslice [µs]
[µs]
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 46
OS-Controlled Cache Partitioning
Experiment 3:Combination: Matrix Multiplication + Filter
0
5 0
1 0 0
1 5 0
2 0 0
2 5 0
3 0 0
2 0 k H z 1 0 k H z 5 k H z 3 . 3 k H z 2 k H z
u n p a r t i t i o n e d
p a r t i o t i o n e d
Matrix Multi- plication64 x 64
MM + Filter
1 0 0 %
1 2 0 %
1 4 0 %
1 6 0 %
1 8 0 %
2 0 0 %
2 2 0 %
2 4 0 %
2 6 0 %
2 8 0 %
2 0 k H z 1 0 k H z 5 k H z 3 . 3 k H z 2 k H z
[ms]
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 47
OS-Controlled Cache Partitioning
Caveat: application transparency
some applications require knowledge / control of physical addressese.g., drivers need physical addresses for direct memory address transfers (DMA)
● modify driver● use recent hardware (e.g., Intel VT-d) with address translation
for DMA
Caveat:
caches (especially in SMP) have become much more
complicated
WS 2011/12 Real-Time Systems, Hardware / Hermann Härtig / Marcus Völp 48
Conclusions
● High performance CPUs are rarely used in embedded systems
● HW means to increase predictability
● SW tweaks to use unpredictable HW in a more predictable way
● References:● Liedtke, Härtig, Hohmuth (RTAS '97):
Operating system control cache predictability for real-time applications
● ARM Real View Emulation Board User Guide
● ARM 1176jzfs Manual
● SAB 80C166 Handbook