Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June 2008 2015/10/28

Garo Bournoutian and Alex OrailogluProceedings of the 45th ACM/IEEE Design Automation

Conference (DAC’08)June 2008

112/04/20

Today, embedded processors are expected to be able to run complex, algorithm-heavy applications that were originally designed and coded for general-purpose processors. As a result, traditional methods for addressing performance and determinism become inadequate.

This paper explores a new data cache design for use in modern high-performance embedded processors that will dynamically improve execution time, power efficiency, and determinism within the system. The simulation results show significant improvement in cache miss ratios and reduction in power consumption of approximately 30% and 15 %, respectively.

AbstractAbstract

- 2 -

Primary (L1) caches in embedded processors are direct-mapped for power efficiency However, direct-mapped caches are predisposed to thrashing

Hence, require a cache design that will Improve performance, power efficiency, and determinism Minimize the area cost

What’s the ProblemWhat’s the Problem

- 3 -

Related WorksRelated Works

- 4 -

Cache optimization techniques for embedded

processors

Cache optimization techniques for embedded

processors

Reduce cache conflict and cache pollution

Reduce cache conflict and cache pollution

Filter caches [6]Filter caches [6]

Dynamically detect the thrashing behavior and expand the select sets for data cache

Dynamically detect the thrashing behavior and expand the select sets for data cacheThis Paper:

Expandable cache lookup only when necessary

Expandable cache lookup only when necessary

Increase power efficiency

Increase power efficiency

Retain data evicted from $ in a small

associative victim cache [2]

Retain data evicted from $ in a small

associative victim cache [2]

Provide extended associativity

Provide extended associativity Dual data cache

scheme that can distinguish spatial, temporal, singe-use

memory reference [3]

Dual data cache scheme that can

distinguish spatial, temporal, singe-use

memory reference [3]

Improve cache utilizationImprove cache utilization

Application-specific cache partitioning [4]

Application-specific cache partitioning [4]

Pseudo-associative

caches: place blocks in a second associated line [5]

Pseudo-associative

caches: place blocks in a second associated line [5]

Shut down cache ways that adapts to application [7]

Shut down cache ways that adapts to application [7]

Illustrate why we need to expand the select sets dynamically Insufficiency of the victim cache

Example thrashing code B and E map to Set-S, C and F map to Set-Q, A and D map to Set-R

Motivative Example Motivative Example

- 5 -

Set-R

Set-S

Set-Q

Successive cache thrashing

Successive cache thrashing

… Cache trace of the example thrashing code

B and E map to Set-S, C and F map to Set-Q, A and D map to Set-R

Motivative Example- Cont.Motivative Example- Cont.

- 6 -

B[i]B[i]Set-S

C[i]C[i]Set-Q

A[i]A[i]Set-R

Set-*

2 entry victim cache

Main cacheE[i]

F[i]

D[i]

C[i]

B[i] A[i]

E[i]

B[i]

C[i]

A[i]

F[i]

D[i]

Uncorrelated evicted data

polluting the victim cache

Uncorrelated evicted data

polluting the victim cache

The Dynamically Expandable L1 Cache The Dynamically Expandable L1 Cache ArchitectureArchitecture

- 7 -

1st. Circular recently-evicted-set list

1st. Circular recently-evicted-set list

2nd. Expandable cache lookup

2nd. Expandable cache lookup

A small circular list Keep track of the index of the most recently evicted sets

Goal: detect a probable thrashing set Operation

Look up the circular list only when a cache miss。If the missed set is present in the list

Enable the expand bit for that set

The access and update of the circular list Only occur during a cache miss

。Timing is not affected

(1) Circular Recently-Evicted-Set List(1) Circular Recently-Evicted-Set List

- 8 -

Conclude the current set is in a thrashing state and

should dynamically be expanded

Conclude the current set is in a thrashing state and

should dynamically be expanded

Goal: allow a set to re-lookup into a predefined secondary set (virtually double associativity of a given set)

Operation

The secondary set is determined by a fixed mapping function。Flip the most significant bit of the set index

Besides expand bit, toggle bit for each cache set。Lookup initially on primary set or secondary set

Enable: when a cache hit occurs on the secondary set Disable: when a cache hit occurs on the primary set

(2) Expandable Cache Lookup(2) Expandable Cache Lookup

- 9 -

1st lookup: cache miss and expand bit= 1

2nd lookup in the predefined secondary set on next cycle - found: cache hit with one cycle penalty - not found: full cache miss

1st lookup: cache miss and expand bit= 1

2nd lookup in the predefined secondary set on next cycle - found: cache hit with one cycle penalty - not found: full cache miss

“00”“00”“01”“01”“10”“10”“11”“11”

Index

Probable thrashing set is detected by first

mechanism

Probable thrashing set is detected by first

mechanism

Cache trace of the proposed cache architecture

A Demonstrative ExampleA Demonstrative Example

- 10 -

B[i]B[i]Set-S

C[i]C[i]Set-Q

A[i]A[i]Set-R

Circular list

Main cacheE[i]

F[i]

D[i]

Set-S

Set-S’

Set-Q’

Set-R’

… …

Set-Q

Set-R

B[i]

C[i]

A[i]

Expand

1

1

1

UpdateUpdateUpdate

Use SimpleScalar toolset [8] for performance evaluation Two baseline configuration

。256-set, direct-mapped L1 data cache with a 32-byte line size。256-set, 4-way set-associative L1 data cache with a 32-byte line size

Use CACTI[10] to evaluate the power efficiency Assume L1/L2 power ratio of 20

。Cost 20 times of power to access data in L2 than it does in L1

Benchmarks 7 representative programs of the SPEC CPU2000 suite [9]

Experimental SetupExperimental Setup

- 11 -

Criteria: miss rate reduction

The miss rate improvement of the proposed implementation。The arithmetic mean is 30.75%

Performance Improvement- Performance Improvement- Direct-Mapped CacheDirect-Mapped Cache

- 12 -

8-entry victim $

Improvement over baseline


5-entry recently-

evicted-set list

Criteria: miss rate reduction

The miss rate improvement of the proposed implementation。The arithmetic mean is 26.74%

Performance Improvement- Performance Improvement- 4-Way Set-Associative Cache4-Way Set-Associative Cache

- 13 -

64-entry victim $



8-entry recently-

evicted-set list

Significant miss rate reduction for both direct-mapped and set-associative caches

Significant miss rate reduction for both direct-mapped and set-associative caches

The power reduction of the proposed implementation The average is 15.73%

Power Improvement- Power Improvement- Direct-Mapped CacheDirect-Mapped Cache

- 14 -

Consistently provide power

reduction

Consistently provide power

reduction

However, the power reduction across the benchmarks The average was still an improvement of 4.19%

Power Improvement- Power Improvement- 4-Way Set-Associative Cache4-Way Set-Associative Cache

- 15 -

Exception:Higher power costs

This paper proposed a dynamically expandable data cache architecture Compose of two main mechanisms

。Circular recently-evicted-set list Detect a probable thrashing set

。Expandable cache lookup Virtually increase the associativity of a given set

Experimental results show that the proposed technique Significant reduction in cache misses and power consumption

。For both direct-mapped and set-associative caches

ConclusionsConclusions

- 16 -

The related works are not strongly connected The results for power usage improvement are too coarse

Don’t show the extra power consumption of the support circuit

The results for different length of the circular list are not shown

Comment for This PaperComment for This Paper

- 17 -

Documents

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June 2008 2015/10/28