Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache
Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund
Hugues Cassé, Haluk Ozaktas and Christine Rochange Université Paul Sabatier, Toulouse
Outline
• Cache Coherence in Todays Architectures
• Cache Coherence and Static WCET-Analysis
• In Short: The On-Demand Coherent Cache
• Static WCET-Analysis with ODC²
• Evaluation
• Conclusion
2
Cache Coherence in Todays Architectures
• Caches are part of almost every multi/many-core architecture.
• Accesses on shared data induce inconsistence of data in caches.
• Coherent accesses are preserved by cache coherence protocols.
• Coherence protocols affect the timing of data accesses.
• Coherence protocols affect the state and content of the caches.
• Coherence protocols affect a static worst-case execution time analysis.
3
Inconsistent data
Core
Cache X = 0
Core
Cache X = 1
Example: Quad-Core LEON4-FT GR712RC
• Bus interconnect
• Private L1-Caches
• Shared L2-Cache
• Write-Through Policy
• Cache Coherence is ensured using a Snooping-based Write-Invalidate protocol.
4
Snooping: Any core listen to data accesses on the bus. Write-Invalidate: Invalidation of caches data by write accesses of other cores.
[1] Aeroflex Gaisler LEON4 Product Sheet
Quad Core LEON4 Block Diagram [1]
Snooping: Any core listen to data accesses on the bus. Write-Invalidate: Invalidation of caches data by write accesses of other cores.
Example: Quad-Core LEON4-FT GR712RC
• Bus interconnect
• Private L1-Caches
• Shared L2-Cache
• Write-Through Strategy
• Cache Coherence is ensured using a Snooping-based Write-Invalidate protocol.
5
Core 0
Cache
Core 1
Cache
Shared Memory Data X = 0
Data X = 0 Data X = 0
Snooping: Any core listen to data accesses on the bus. Write-Invalidate: Invalidation of caches data by write accesses of other cores.
Example: Quad-Core LEON4-FT GR712RC
• Bus interconnect
• Private L1-Caches
• Shared L2-Cache
• Write-Through Strategy
• Cache Coherence is ensured using a Snooping-based Write-Invalidate protocol.
6
Core 0
Cache
Core 1
Cache
Shared Memory Data X = 0
Data X = 1 Data X = 0
Write X = 1
Data X = 1 Data X = 1
Invalid X
Snooping: Any core listen to data accesses on the bus. Write-Invalidate: Invalidation of caches data by write accesses of other cores.
Example: Quad-Core LEON4-FT GR712RC
• Bus interconnect
• Private L1-Caches
• Shared L2-Cache
• Write-Through Strategy
• Cache Coherence is ensured using a Snooping-based Write-Invalidate protocol.
7
Core 0
Cache
Core 1
Cache
Shared Memory Data X = 1
Data X = 1
Read X = 1
Data X = 1
Cache Coherence and Static WCET-Analysis
8
• Estimation of the WCET using a static analysis tool
• Abstraction of the hardware architecture incl. caches
• Cache Analysis to calculate data access latencies
Overview of the OTAWA framework [2]
• Prediction of cache-hits and cache-misses • Abstraction of possible cache states (Must/May Analysis) • Modification of cache state by data accesses in program • Analysis of private cache in isolation from other cores
Cache Analysis
[2] Ballabriga, C. et al.: OTAWA: an Open Toolbox for Adaptive WCET Analysis. In: The 8th IFIP Workshop on Software Technologies for Future Embedded and Ubiquitous Systems
Cache Analysis
Cache Coherence and Static WCET-Analysis
9
CORE
Cache
• External Invalidations turn the cache into a shared resource
• Target of invalidation is unknown -> Invalidation of any data must be assumed
• Time of invalidation is unknown -> Invalidation must be assumed at any possible time
• Prediction of cache-hits is nearly impossible
External invalidations prohibit a feasible WCET estimation.
Data X = 1
External Invalidation
Data X = 0
Example: Quad-Core LEON4-FT GR712RC
• Bus interconnect
• Private L1-Caches
• Shared L2-Cache
• Write-Through Policy
• Alternative: Pure software-based cache coherence using the Cache-Flush instruction.
10
Cache-Flush: Invalidation of the complete cache triggered by a special instruction.
Quad Core LEON4 Block Diagram [1]
[1] Aeroflex Gaisler LEON4 Product Sheet
On-Demand Coherent Cache (ODC²)
• Software/Hardware co-approach
• Handling of coherence on demand
• Different handling of accesses to private and shared data
• Access to shared data only in protected code regions
• Invalidation of shared data after use
• No coherence transactions
11
Core ODC Cache
Replacement information
Control flags (Valid, Modified)
Indication of possible carrier of shared data
Shared Bit Crtl Repl. info Data Tag
ODC² Block-Frame
On-Demand Coherent Cache (ODC²)
12
…
- Accesses on private data -
…
Mutex Lock / Barrier;
enter_shared_mode();
…
- Accesses on private/shared data -
…
exit_shared_mode();
Mutex Unlock / Barrier;
…
- Accesses on private data -
…
ODC² in Private-Mode
• No coherence-operations
• Functionality and timing equal to incoherent cache
ODC² in Shared-Mode
• Newly loaded cache-lines are marked as shared
• Checking of target address
• Write-Through Policy
ODC² in Restore-Procedure
• Invalidation of marked cache-lines
ODC² in Private-Mode
• No coherence-operations
• Functionality and timing equal to incoherent cache
Abstract Program ODC² Functionality
Cache Analysis
Static WCET-Analysis with ODC²
13
CORE
ODC Cache
• Cache remains private resource • Time and target of invalidation is known • Invalidation increases cache-miss rate • Invalidation improve efficiency • Influence on timing is bounded and
predictable • Additional Shared Analysis to identify
accesses to shared data • New classification of accesses in Must/May
Analysis: Always Shared, Never Shared, Sometimes Shared
Data X = 0
Evaluation
14
OTAWA evaluation framework
parMERASA real-time capable many-core architecture incl. ODC²
Parallel benchmarks
Program flow and synchro-nisation
Cache analysis enhanced with ODC² model
WCET analysis of ODC² and common alternatives using the OTAWA toolbox
System architecture:
• Tree-type interconnect
• PPC-based cores with private L1-Caches
• Shared memory
Parallel Benchmarks:
• Matrix Multiplication
• Fast Fourier Transformation
• Dijkstra Shortest Path
Evaluated Platforms:
• ODC²
• Uncached shared data
• Software Cache-Flush
• Magic Coherence
Evaluation
15
• WCET Analysis regarding parallel execution with 2-8 cores.
• Significantly reduced WCET compared to Uncached and Cache-Flush
• Reduced accesses to shared memory with ODC²
• Coherence overhead of ODC² depends on access pattern to shared memory
Matrix FFT Dijkstra
0
50000
100000
150000
200000
250000
300000
350000
Num
be
r o
f A
cce
sse
s
Matrix FFT Djikstra
0,00
0,50
1,00
1,50
2,00
2,50
3,00
3,50
Uncached
Flush
ODC²
Magic
No
rmalis
ed
WC
ET
Matrix FFT Dijkstra
0
50000
100000
150000
200000
250000
300000
Num
be
r o
f W
T A
cce
sse
s
Conclusion
16
• Existing hardware cache coherence techniques are not suitable for a static WCET analysis, software coherence techniques lack in performance.
• ODC² permits time-predictable hardware/software co-approach.
• Static timing analysis using ODC² allows significant reduction of WCET compared to common alternatives.
• Promising approach for hard real-time multicore systems.
Thank you for your attention
Any questions?