16
WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund Hugues Cassé, Haluk Ozaktas and Christine Rochange Université Paul Sabatier, Toulouse

WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache

Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

Hugues Cassé, Haluk Ozaktas and Christine Rochange Université Paul Sabatier, Toulouse

Page 2: WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

Outline

• Cache Coherence in Todays Architectures

• Cache Coherence and Static WCET-Analysis

• In Short: The On-Demand Coherent Cache

• Static WCET-Analysis with ODC²

• Evaluation

• Conclusion

2

Page 3: WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

Cache Coherence in Todays Architectures

• Caches are part of almost every multi/many-core architecture.

• Accesses on shared data induce inconsistence of data in caches.

• Coherent accesses are preserved by cache coherence protocols.

• Coherence protocols affect the timing of data accesses.

• Coherence protocols affect the state and content of the caches.

• Coherence protocols affect a static worst-case execution time analysis.

3

Inconsistent data

Core

Cache X = 0

Core

Cache X = 1

Page 4: WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

Example: Quad-Core LEON4-FT GR712RC

• Bus interconnect

• Private L1-Caches

• Shared L2-Cache

• Write-Through Policy

• Cache Coherence is ensured using a Snooping-based Write-Invalidate protocol.

4

Snooping: Any core listen to data accesses on the bus. Write-Invalidate: Invalidation of caches data by write accesses of other cores.

[1] Aeroflex Gaisler LEON4 Product Sheet

Quad Core LEON4 Block Diagram [1]

Page 5: WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

Snooping: Any core listen to data accesses on the bus. Write-Invalidate: Invalidation of caches data by write accesses of other cores.

Example: Quad-Core LEON4-FT GR712RC

• Bus interconnect

• Private L1-Caches

• Shared L2-Cache

• Write-Through Strategy

• Cache Coherence is ensured using a Snooping-based Write-Invalidate protocol.

5

Core 0

Cache

Core 1

Cache

Shared Memory Data X = 0

Data X = 0 Data X = 0

Page 6: WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

Snooping: Any core listen to data accesses on the bus. Write-Invalidate: Invalidation of caches data by write accesses of other cores.

Example: Quad-Core LEON4-FT GR712RC

• Bus interconnect

• Private L1-Caches

• Shared L2-Cache

• Write-Through Strategy

• Cache Coherence is ensured using a Snooping-based Write-Invalidate protocol.

6

Core 0

Cache

Core 1

Cache

Shared Memory Data X = 0

Data X = 1 Data X = 0

Write X = 1

Data X = 1 Data X = 1

Invalid X

Page 7: WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

Snooping: Any core listen to data accesses on the bus. Write-Invalidate: Invalidation of caches data by write accesses of other cores.

Example: Quad-Core LEON4-FT GR712RC

• Bus interconnect

• Private L1-Caches

• Shared L2-Cache

• Write-Through Strategy

• Cache Coherence is ensured using a Snooping-based Write-Invalidate protocol.

7

Core 0

Cache

Core 1

Cache

Shared Memory Data X = 1

Data X = 1

Read X = 1

Data X = 1

Page 8: WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

Cache Coherence and Static WCET-Analysis

8

• Estimation of the WCET using a static analysis tool

• Abstraction of the hardware architecture incl. caches

• Cache Analysis to calculate data access latencies

Overview of the OTAWA framework [2]

• Prediction of cache-hits and cache-misses • Abstraction of possible cache states (Must/May Analysis) • Modification of cache state by data accesses in program • Analysis of private cache in isolation from other cores

Cache Analysis

[2] Ballabriga, C. et al.: OTAWA: an Open Toolbox for Adaptive WCET Analysis. In: The 8th IFIP Workshop on Software Technologies for Future Embedded and Ubiquitous Systems

Page 9: WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

Cache Analysis

Cache Coherence and Static WCET-Analysis

9

CORE

Cache

• External Invalidations turn the cache into a shared resource

• Target of invalidation is unknown -> Invalidation of any data must be assumed

• Time of invalidation is unknown -> Invalidation must be assumed at any possible time

• Prediction of cache-hits is nearly impossible

External invalidations prohibit a feasible WCET estimation.

Data X = 1

External Invalidation

Data X = 0

Page 10: WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

Example: Quad-Core LEON4-FT GR712RC

• Bus interconnect

• Private L1-Caches

• Shared L2-Cache

• Write-Through Policy

• Alternative: Pure software-based cache coherence using the Cache-Flush instruction.

10

Cache-Flush: Invalidation of the complete cache triggered by a special instruction.

Quad Core LEON4 Block Diagram [1]

[1] Aeroflex Gaisler LEON4 Product Sheet

Page 11: WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

On-Demand Coherent Cache (ODC²)

• Software/Hardware co-approach

• Handling of coherence on demand

• Different handling of accesses to private and shared data

• Access to shared data only in protected code regions

• Invalidation of shared data after use

• No coherence transactions

11

Core ODC Cache

Replacement information

Control flags (Valid, Modified)

Indication of possible carrier of shared data

Shared Bit Crtl Repl. info Data Tag

ODC² Block-Frame

Page 12: WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

On-Demand Coherent Cache (ODC²)

12

- Accesses on private data -

Mutex Lock / Barrier;

enter_shared_mode();

- Accesses on private/shared data -

exit_shared_mode();

Mutex Unlock / Barrier;

- Accesses on private data -

ODC² in Private-Mode

• No coherence-operations

• Functionality and timing equal to incoherent cache

ODC² in Shared-Mode

• Newly loaded cache-lines are marked as shared

• Checking of target address

• Write-Through Policy

ODC² in Restore-Procedure

• Invalidation of marked cache-lines

ODC² in Private-Mode

• No coherence-operations

• Functionality and timing equal to incoherent cache

Abstract Program ODC² Functionality

Page 13: WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

Cache Analysis

Static WCET-Analysis with ODC²

13

CORE

ODC Cache

• Cache remains private resource • Time and target of invalidation is known • Invalidation increases cache-miss rate • Invalidation improve efficiency • Influence on timing is bounded and

predictable • Additional Shared Analysis to identify

accesses to shared data • New classification of accesses in Must/May

Analysis: Always Shared, Never Shared, Sometimes Shared

Data X = 0

Page 14: WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

Evaluation

14

OTAWA evaluation framework

parMERASA real-time capable many-core architecture incl. ODC²

Parallel benchmarks

Program flow and synchro-nisation

Cache analysis enhanced with ODC² model

WCET analysis of ODC² and common alternatives using the OTAWA toolbox

System architecture:

• Tree-type interconnect

• PPC-based cores with private L1-Caches

• Shared memory

Parallel Benchmarks:

• Matrix Multiplication

• Fast Fourier Transformation

• Dijkstra Shortest Path

Evaluated Platforms:

• ODC²

• Uncached shared data

• Software Cache-Flush

• Magic Coherence

Page 15: WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

Evaluation

15

• WCET Analysis regarding parallel execution with 2-8 cores.

• Significantly reduced WCET compared to Uncached and Cache-Flush

• Reduced accesses to shared memory with ODC²

• Coherence overhead of ODC² depends on access pattern to shared memory

Matrix FFT Dijkstra

0

50000

100000

150000

200000

250000

300000

350000

Num

be

r o

f A

cce

sse

s

Matrix FFT Djikstra

0,00

0,50

1,00

1,50

2,00

2,50

3,00

3,50

Uncached

Flush

ODC²

Magic

No

rmalis

ed

WC

ET

Matrix FFT Dijkstra

0

50000

100000

150000

200000

250000

300000

Num

be

r o

f W

T A

cce

sse

s

Page 16: WCET Analysis of Parallel Benchmarks using On …WCET Analysis of Parallel Benchmarks using On-Demand Coherent Cache Arthur Pyka, Lillian Tadros, Sascha Uhrig University of Dortmund

Conclusion

16

• Existing hardware cache coherence techniques are not suitable for a static WCET analysis, software coherence techniques lack in performance.

• ODC² permits time-predictable hardware/software co-approach.

• Static timing analysis using ODC² allows significant reduction of WCET compared to common alternatives.

• Promising approach for hard real-time multicore systems.

Thank you for your attention

Any questions?