Upload
rachel-hutchinson
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Garo Bournoutian and Alex OrailogluProceedings of the 45th ACM/IEEE Design Automation
Conference (DAC’08)June 2008
112/04/20
Today, embedded processors are expected to be able to run complex, algorithm-heavy applications that were originally designed and coded for general-purpose processors. As a result, traditional methods for addressing performance and determinism become inadequate.
This paper explores a new data cache design for use in modern high-performance embedded processors that will dynamically improve execution time, power efficiency, and determinism within the system. The simulation results show significant improvement in cache miss ratios and reduction in power consumption of approximately 30% and 15 %, respectively.
AbstractAbstract
- 2 -
Primary (L1) caches in embedded processors are direct-mapped for power efficiency However, direct-mapped caches are predisposed to thrashing
Hence, require a cache design that will Improve performance, power efficiency, and determinism Minimize the area cost
What’s the ProblemWhat’s the Problem
- 3 -
Related WorksRelated Works
- 4 -
Cache optimization techniques for embedded
processors
Cache optimization techniques for embedded
processors
Reduce cache conflict and cache pollution
Reduce cache conflict and cache pollution
Filter caches [6]Filter caches [6]
Dynamically detect the thrashing behavior and expand the select sets for data cache
Dynamically detect the thrashing behavior and expand the select sets for data cacheThis Paper:
Expandable cache lookup only when necessary
Expandable cache lookup only when necessary
Increase power efficiency
Increase power efficiency
Retain data evicted from $ in a small
associative victim cache [2]
Retain data evicted from $ in a small
associative victim cache [2]
Provide extended associativity
Provide extended associativity Dual data cache
scheme that can distinguish spatial, temporal, singe-use
memory reference [3]
Dual data cache scheme that can
distinguish spatial, temporal, singe-use
memory reference [3]
Improve cache utilizationImprove cache utilization
Application-specific cache partitioning [4]
Application-specific cache partitioning [4]
Pseudo-associative
caches: place blocks in a second associated line [5]
Pseudo-associative
caches: place blocks in a second associated line [5]
Shut down cache ways that adapts to application [7]
Shut down cache ways that adapts to application [7]
Illustrate why we need to expand the select sets dynamically Insufficiency of the victim cache
Example thrashing code B and E map to Set-S, C and F map to Set-Q, A and D map to Set-R
Motivative Example Motivative Example
- 5 -
Set-R
Set-S
Set-Q
Successive cache thrashing
Successive cache thrashing
… Cache trace of the example thrashing code
B and E map to Set-S, C and F map to Set-Q, A and D map to Set-R
Motivative Example- Cont.Motivative Example- Cont.
- 6 -
B[i]B[i]Set-S
C[i]C[i]Set-Q
A[i]A[i]Set-R
Set-*
2 entry victim cache
Main cacheE[i]
F[i]
D[i]
C[i]
B[i] A[i]
E[i]
B[i]
C[i]
A[i]
F[i]
D[i]
Uncorrelated evicted data
polluting the victim cache
Uncorrelated evicted data
polluting the victim cache
The Dynamically Expandable L1 Cache The Dynamically Expandable L1 Cache ArchitectureArchitecture
- 7 -
1st. Circular recently-evicted-set list
1st. Circular recently-evicted-set list
2nd. Expandable cache lookup
2nd. Expandable cache lookup
A small circular list Keep track of the index of the most recently evicted sets
Goal: detect a probable thrashing set Operation
Look up the circular list only when a cache miss。If the missed set is present in the list
Enable the expand bit for that set
The access and update of the circular list Only occur during a cache miss
。Timing is not affected
(1) Circular Recently-Evicted-Set List(1) Circular Recently-Evicted-Set List
- 8 -
Conclude the current set is in a thrashing state and
should dynamically be expanded
Conclude the current set is in a thrashing state and
should dynamically be expanded
Goal: allow a set to re-lookup into a predefined secondary set (virtually double associativity of a given set)
Operation
The secondary set is determined by a fixed mapping function。Flip the most significant bit of the set index
Besides expand bit, toggle bit for each cache set。Lookup initially on primary set or secondary set
Enable: when a cache hit occurs on the secondary set Disable: when a cache hit occurs on the primary set
(2) Expandable Cache Lookup(2) Expandable Cache Lookup
- 9 -
1st lookup: cache miss and expand bit= 1
2nd lookup in the predefined secondary set on next cycle - found: cache hit with one cycle penalty - not found: full cache miss
1st lookup: cache miss and expand bit= 1
2nd lookup in the predefined secondary set on next cycle - found: cache hit with one cycle penalty - not found: full cache miss
“00”“00”“01”“01”“10”“10”“11”“11”
Index
Probable thrashing set is detected by first
mechanism
Probable thrashing set is detected by first
mechanism
Cache trace of the proposed cache architecture
A Demonstrative ExampleA Demonstrative Example
- 10 -
B[i]B[i]Set-S
C[i]C[i]Set-Q
A[i]A[i]Set-R
Circular list
Main cacheE[i]
F[i]
D[i]
Set-S
Set-S’
Set-Q’
Set-R’
… …
Set-Q
Set-R
B[i]
C[i]
A[i]
Expand
1
1
1
UpdateUpdateUpdate
Use SimpleScalar toolset [8] for performance evaluation Two baseline configuration
。256-set, direct-mapped L1 data cache with a 32-byte line size。256-set, 4-way set-associative L1 data cache with a 32-byte line size
Use CACTI[10] to evaluate the power efficiency Assume L1/L2 power ratio of 20
。Cost 20 times of power to access data in L2 than it does in L1
Benchmarks 7 representative programs of the SPEC CPU2000 suite [9]
Experimental SetupExperimental Setup
- 11 -
Criteria: miss rate reduction
The miss rate improvement of the proposed implementation。The arithmetic mean is 30.75%
Performance Improvement- Performance Improvement- Direct-Mapped CacheDirect-Mapped Cache
- 12 -
8-entry victim $
Improvement over baseline
Improvement over baseline
5-entry recently-
evicted-set list
Criteria: miss rate reduction
The miss rate improvement of the proposed implementation。The arithmetic mean is 26.74%
Performance Improvement- Performance Improvement- 4-Way Set-Associative Cache4-Way Set-Associative Cache
- 13 -
64-entry victim $
Improvement over baseline
Improvement over baseline
8-entry recently-
evicted-set list
Significant miss rate reduction for both direct-mapped and set-associative caches
Significant miss rate reduction for both direct-mapped and set-associative caches
The power reduction of the proposed implementation The average is 15.73%
Power Improvement- Power Improvement- Direct-Mapped CacheDirect-Mapped Cache
- 14 -
Consistently provide power
reduction
Consistently provide power
reduction
However, the power reduction across the benchmarks The average was still an improvement of 4.19%
Power Improvement- Power Improvement- 4-Way Set-Associative Cache4-Way Set-Associative Cache
- 15 -
Exception:Higher power costs
This paper proposed a dynamically expandable data cache architecture Compose of two main mechanisms
。Circular recently-evicted-set list Detect a probable thrashing set
。Expandable cache lookup Virtually increase the associativity of a given set
Experimental results show that the proposed technique Significant reduction in cache misses and power consumption
。For both direct-mapped and set-associative caches
ConclusionsConclusions
- 16 -