1
RDIP: RAS-Directed Instruction Prefetching Aasheesh Kolli, Thomas F. Wenisch (University of Michigan) and Ali Saidi (ARM) 1. The instruction fetch bottleneck Performance gains of up to 50% and 16% on average possible with an ideal I$. © 2013 Aasheesh Kolli Instruction cache performance continues to be a bottleneck for modern server workloads. 4. RDIP design and challenges 5. RAS signature generation 9. Simulation Methodology 10. Coverage and erroneous prefetches 11. Performance gem5 full system simulator Core parameters: Core: 2GHz OoO, 8-wide commit, 16-entry RAS I-Cache: 32KB/2-way/64B, 2 cycles D-Cache: 64KB/2-way/64B, 3 cycles L2: 2MB/8-way/64B, 24 cycles Workloads: gem5 gem5 running a spec benchmark (twolf) HD-teraread Hadoop: Big data search MapReduce job HD-wdcnt Hadoop: Word count ssj Tests Java performance in SPECpower MC-friendfeed Memcached: “Facebook”-like app MC-microblog Memcached: “Twitter”-like app This work is partially supported by the National Science Foundation under grant CCF-0815457. 0.8 1 1.2 1.4 1.6 gem5 HD-teraread HD-wdcnt ssj MC-frinedfeed MC-microblog GMEAN Relative Performance No Prefetcher (NoP) Ideal Constraints on I$ hit-latency and cycle time do not allow for increasing cache capacity Instruction prefetching 2. Why another instruction prefetcher? Next-2-line (N2L) Proactive Instruction Fetch (PIF) [Ferdman ‘11] Industry standard State-of-the-art academic proposal + Low overhead, implementability + Best performance - Modest performance gains - High overhead ( > 200kB), design complexity 3. Insights Design steps Challenges 1. Use RAS signatures to represent program contexts Accurate representation 2. Map I$ misses to signature in Miss Table Minimize storage 3. Prefetch on next occurrence of signature Timely prefetching Call: XOR contents of RAS after push onto RAS, append 0 Return: XOR contents of RAS before pop from RAS, append 1 Example: A:funcX{ B:funcY{ } …. C:funcY{ Dynamic Instructions RAS contents RAS signature A B A B A C A C A (A)0 (AB)0 (AB)1 (AC)0 (AC)1 Same function called from different locations has different signatures Large functions can be broken down into multiple contexts I$ misses correlate strongly to program context Program contexts are predictable RAS succinctly captures program context 6. Timely prefetching S 1 S 2 S 3 Execution Stream Signature Prefetch Requests Time Mapping I$ misses with corresponding signature late prefetches. Mapping I$ misses with preceding signature timely prefetches 7. RDIP hardware summary Push/Pop RAS Address Overflows Current Signature Previous Signature Lookup Miss Table L2$ Miss Buffer Update 8. RDIP practical design RAS size for signature generation 4 top entries Miss Table 4k entries (4-way associative) Entry size 16B (compaction technique [Ferdman ‘11]) Total storage overhead: 64kB 0 50 100 150 200 250 300 N2L PIF RDIP N2L PIF RDIP N2L PIF RDIP N2L PIF RDIP N2L PIF RDIP N2L PIF RDIP gem5 HD-teraread HD-wdcnt ssj MC-friendfeed MC-microblog Percentage Erroneous Prefetches Misses Prefetch Hits Better Better An ideal prefetcher tries to maximize coverage (def: prefetchHits over prefetchHits+misses) and minimize erroneous prefetches. 0.8 1 1.2 1.4 1.6 gem5 HD-teraread HD-wdcnt ssj MC-friendfeed MC-microblog GMEAN Relative Performance N2L PIF RDIP Ideal Better Coverage: PIF ~ RDIP > N2L Erroneous prefetches: PIF > N2L > RDIP Our goal: Build a low overhead, high accuracy prefetcher Performance increase: N2L 5%, PIF 13%, RDIP 11.5%, Ideal 16% 12. Conclusion In this work, we show the correlation that exists between I$ misses and program contexts and develop a mechanism to use the RAS state to represent program contexts. Leveraging these insights we develop a prefetching mechanism, RDIP. RDIP performs comparably to the state-of- the-art prefetchers at one third storage costs and simpler design.

Aasheesh Kolli, Thomas F. Wenisch (University of … RAS-Directed Instruction Prefetching Aasheesh Kolli, Thomas F. Wenisch (University of Michigan) and Ali Saidi (ARM) 1. The instruction

  • Upload
    ngohanh

  • View
    220

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Aasheesh Kolli, Thomas F. Wenisch (University of … RAS-Directed Instruction Prefetching Aasheesh Kolli, Thomas F. Wenisch (University of Michigan) and Ali Saidi (ARM) 1. The instruction

RDIP: RAS-Directed Instruction Prefetching Aasheesh Kolli, Thomas F. Wenisch (University of Michigan) and Ali Saidi (ARM)

1. The instruction fetch bottleneck

Performance gains of up to 50% and 16% on average

possible with an ideal I$.

© 2013 Aasheesh Kolli

Instruction cache performance continues to be a bottleneck for modern server workloads.

4. RDIP design and challenges

5. RAS signature generation

9. Simulation Methodology

10. Coverage and erroneous prefetches 11. Performance

• gem5 full system simulator

• Core parameters:

• Core: 2GHz OoO, 8-wide commit, 16-entry RAS • I-Cache: 32KB/2-way/64B, 2 cycles • D-Cache: 64KB/2-way/64B, 3 cycles • L2: 2MB/8-way/64B, 24 cycles

• Workloads:

• gem5 – gem5 running a spec benchmark (twolf) • HD-teraread – Hadoop: Big data search MapReduce job • HD-wdcnt – Hadoop: Word count • ssj – Tests Java performance in SPECpower • MC-friendfeed – Memcached:  “Facebook”-like app • MC-microblog – Memcached:  “Twitter”-like app

This work is partially supported by the National Science Foundation under grant CCF-0815457.

0.8

1

1.2

1.4

1.6

gem5 HD-teraread HD-wdcnt ssj MC-frinedfeed MC-microblog GMEAN

Rela

tive

Per

form

ance

No Prefetcher (NoP) Ideal

Constraints on I$ hit-latency and cycle time do not allow for increasing cache capacity Instruction prefetching

2. Why another instruction prefetcher?

Next-2-line (N2L) Proactive Instruction Fetch (PIF) [Ferdman  ‘11]

Industry standard State-of-the-art academic proposal

+ Low overhead, implementability + Best performance

- Modest performance gains - High overhead ( > 200kB), design complexity

3. Insights

Design steps Challenges 1. Use RAS signatures to represent program contexts Accurate representation

2. Map I$ misses to signature in Miss Table Minimize storage

3. Prefetch on next occurrence of signature Timely prefetching

Call: XOR contents of RAS after push onto RAS, append 0 Return: XOR contents of RAS before pop from RAS, append 1

Example:

A:funcX{

B:funcY{

}

…. C:funcY{

}

Dynamic Instructions RAS contents RAS signature

A

B A

B A

C A

C A

(A)0

(A⊕B)0

(A⊕B)1

(A⊕C)0

(A⊕C)1

• Same function called from different locations has different

signatures

• Large functions can be broken down into multiple contexts

• I$ misses correlate strongly to program context

• Program contexts are predictable

• RAS succinctly captures program context

6. Timely prefetching

S1

S2

S3

Execution Stream Signature

Prefetch Requests

Tim

e

Mapping I$ misses with corresponding signature late prefetches.

Mapping I$ misses with preceding signature timely prefetches

7. RDIP hardware summary

Push/Pop

RAS

Address

Overflows

Current Signature

Previous

Signature

Lookup

Miss Table

L2$

Miss Buffer

Update

8. RDIP practical design • RAS size for signature generation 4 top entries

• Miss Table 4k entries (4-way associative)

• Entry size 16B  (compaction  technique  [Ferdman  ‘11])

Total storage overhead: 64kB

0

50

100

150

200

250

300

N2L PIF RDIP N2L PIF RDIP N2L PIF RDIP N2L PIF RDIP N2L PIF RDIP N2L PIF RDIP

gem5 HD-teraread HD-wdcnt ssj MC-friendfeed MC-microblog

Perc

enta

ge

Erroneous Prefetches Misses Prefetch Hits

Better Better

An ideal prefetcher tries to maximize coverage (def: prefetchHits over prefetchHits+misses)

and minimize erroneous prefetches.

0.8

1

1.2

1.4

1.6

gem5 HD-teraread HD-wdcnt ssj MC-friendfeed MC-microblog GMEAN

Rela

tive

Per

form

ance

N2L PIF RDIP Ideal

Better

Coverage: PIF ~ RDIP > N2L Erroneous prefetches: PIF > N2L > RDIP

Our goal: Build a low overhead, high accuracy prefetcher

Performance increase: N2L 5%, PIF 13%, RDIP 11.5%, Ideal 16%

12. Conclusion In this work, we show the correlation that exists between I$ misses and program contexts and develop a mechanism to use the RAS state to represent program contexts. Leveraging these insights we develop a prefetching mechanism, RDIP. RDIP performs comparably to the state-of- the-art prefetchers at one third storage costs and simpler design.