22
Many-Thread Aware Prefetching Mechanisms for GPGPU Application Jaekyu Lee Nagesh B. Lakshminarayana Hyesoon Kim Richard Vudu In the proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010 Paper presentation by Sankalp Shivaprakash

Many-Thread Aware Prefetching Mechanisms for GPGPU Application

  • Upload
    phil

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

Many-Thread Aware Prefetching Mechanisms for GPGPU Application. Jaekyu Lee Nagesh B. Lakshminarayana Hyesoon Kim Richard Vudu. In the proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010. Paper presentation by - PowerPoint PPT Presentation

Citation preview

Page 1: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Jaekyu Lee Nagesh B. Lakshminarayana Hyesoon Kim Richard Vudu

In the proceedings of the 43rd Annual IEEE/ACM International Symposium

on Microarchitecture (MICRO), December 2010

Paper presentation by Sankalp Shivaprakash

Page 2: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Motivation

• Memory latency hiding through multithread prefetching schemes – Per-warp training and Stride promotion– Inter-thread Prefetching– Adaptive Throttling

• Propose software and hardware prefetching mechanisms for a GPGPU architecture – Scalable to large number of threads– Robustness through feedback and throttling

mechanisms to avoid degraded performance

Page 3: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Memory Latency Hiding techniques

• Multithreading– Thread level and Warp level context switching

• Utilization of complex cache memory hierarchies– Using L1, L2, DRAMs than accessing Global Memory each

time• Prefetching

– Insufficient thread-level parallelism• Memory request merging

Thread1 Thread2 Thread1 Thread3

Page 4: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Prefetching – Parallel Architectures• Reason for prefetching: Consider warp1 and warp2 having

three instructions(Add, Sub, Load)• Without prefetch:

• With prefetch:

– Prefetch1: Fetching for Load2– Prefetch2: Fetching for Load3

Warp1 Warp2

Idle-Load2 for Warp2

Load1 for Warp1

Warp1 Warp2

Load1 for Warp1

Warp3

Page 5: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Prefetching (Contd)

• Software Prefetching– Prefetching into Registers

– Prefetching into Cache• Congestion in Cache if not controlled and accurate• Data could get polluted

Page 6: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Prefetching (Contd)• Hardware Prefetching

– Stream Prefetcher• Monitors the direction of access in a memory region• Once a constant access direction is detected, launch

prefetches in that direction– Stride Prefetcher

• Tracks the difference in address between two accesses• Launches prefetch requests using the delta once a constant

difference is detected– GHB Prefetcher (Global History Buffer)

• Stores miss addresses in an n-entry FIFO table(GHB table)• Each miss address points to another entry(right) which can

detect stream, stride and irregular repeating address patterns

*Characterize Aggressiveness

10002000

010001000

δ= 1000

Page 7: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Many-Thread aware prefetching

• Conventional Stride Prefetching• Inter-thread Prefetching(IP)

MT-SWP

Page 8: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Many-Thread aware prefetching

Scalable versions of the traditional training policies, for PC based stride prefetchers

• Per warp training– Strong stride behavior exists within a warp– Stride information trained per warp is stored in a

PWS (Per Warp Stride) Table

MT-HWP

Page 9: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Many-Thread aware prefetching

• Stride Promotion– Considering the stride pattern is the same across all

warps for a given PC, PWS is monitored for three accesses

– If found same stride, promote the PWS to Global Stride(GS) table, if not, retain in PWS

• Inter-thread Prefetching– Monitor stride pattern across threads at the same PC,

for 3 memory accesses– If found same, stride information is stored in the IP

table

MT-HWP

Page 10: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Many-Thread aware prefetching• Implementation

• When there are hits in both GS and IP, GS is given preference because– Strides within warp are more common than those across

warps– Trained for a longer period

MT-HWP

Page 11: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Useful vs. Harmful Prefetching

• MTAML-Minimum Tolerable Average Memory Latency– Minimum average number of cycles per memory request that

does not lead to stalls

• MTAML_pref

Page 12: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Useful vs. Harmful Prefetching

• Comparison of MTAML and measured average latency (AVG Latency)

12

3

1. AVG Latency < MTAML & AVG Latency(PREF)< MTAML_pref

2. AVG Latency > MTAML: Prefetching beneficial provided AVG Latency (PREF) is less than MTAML_pref

3. Prefetching might turn out useful/Harmful

• Measured AVG Latency(PREF) ignores successively prefetched memory operations

• Greater contention seen when the number of warps increase and delay increased

Page 13: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Useful vs. Harmful Prefetching

• Harmful prefetch requests could be due to:– Queuing Delays– DRAM row-buffer conflicts– Wasting of off-chip bandwidth due to early eviction– Wasting of off-chip bandwidth due to inaccurate prefetches

Page 14: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Metrics for Adaptive Prefetch Throttling

• Early Eviction Rate

• Merge Ratio

Avoids :• Consumption of system bandwidth• Delay requests• Occupation of Cache by unnecessary prefetches

Prefetch requests might be late through prefetch merges but that is compensated through context switching across warps

Page 15: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Metrics for Adaptive Prefetch Throttling

• Monitoring of Early Eviction and Merge Ratio

Page 16: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Methodology

• Baseline processor used is NVIDIA’s 8800GT• Applications to simulator is generated using GPUOcelot,

a binary translator framework for PTX

Page 17: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Methodology

Page 18: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Results and Discussion

Page 19: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Results and Discussion

Page 20: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Results and Discussion

Page 21: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Conclusion

• The throttling mechanism proposed in this paper is in a way controlling the aggressiveness of prefetching rather than completely curbing it

• The metrics considered were convincing enough to avoid cache pollution due to early eviction and employ memory merging and did not consider accuracy alone

• Scalability and robustness was given importance• The study does not consider complex cache memory

hierarchies• Overhead of prefetching is not clearly substantiated

Page 22: Many-Thread Aware Prefetching Mechanisms for GPGPU Application

Thank You