Software Pipelining on Multi-Core Architectures
Alban Douillet, Guang R. Gao
CISC 879 : Software Support for Multicore Architectures
Tom St. JohnDept of Electrical and Computer Engineering
University of Delaware
Outline
• Introduction
• Single-Dimension Software Pipelining
CISC 879 : Software Support for Multicore Architectures
• Multi-Threaded Software Pipelining
• Experiments
Introduction
• Software pipelining is among most successful optimizations
• Can it be applied to multi-core chips?
• What extensions are required?
CISC 879 : Software Support for Multicore Architectures
Single-Dimension SWP
• Does not simply pipeline innermost loop
- Pipelines most profitable loop level
• Loop levels enclosing selected loop left to global scheduler
Selected loop seen as outermost loop
CISC 879 : Software Support for Multicore Architectures
- Selected loop seen as outermost loop
- Inner loops executed sequentially
• Able to take advantage of ILP/data locality properties present in other loops
SSP Example
CISC 879 : Software Support for Multicore Architectures
SSP Example
CISC 879 : Software Support for Multicore Architectures
SSP Example
CISC 879 : Software Support for Multicore Architectures
Multi-Threaded SWP
• Several Obstacles Exist
• Dependences/resource constraints must be respected
• Operation cannot be scheduled before all dependences are satisfied
CISC 879 : Software Support for Multicore Architectures
dependences are satisfied
• Memory dependences may exist between thread units
- Synchronization is required
Multi-ThreadedFinal Schedule
• Schedule each group of Sn iterations on a thread unit using round-robin approach
- Workload balance is fair
• Sn is max number of iterations that can be executed in parallel without resource conflict
CISC 879 : Software Support for Multicore Architectures
in parallel without resource conflict
• Thread units may share same instruction cache
Final Schedule Example
CISC 879 : Software Support for Multicore Architectures
Final Schedule Example
CISC 879 : Software Support for Multicore Architectures
Data Dependencies
• Data dependencies may exist between outermost iterations
• Synchronization points are chosen to minimize code duplication during code generation
- WAIT is placed before each repeating pattern
CISC 879 : Software Support for Multicore Architectures
- WAIT is placed before each repeating pattern
- SIGNAL is placed after each pattern
• Synchronization delay guarantees the correctness of the schedule
Synchronization DelayExample
CISC 879 : Software Support for Multicore Architectures
Synchronization DelayExample
CISC 879 : Software Support for Multicore Architectures
Synchronization DelayExample
CISC 879 : Software Support for Multicore Architectures
Synchronization
• Each thread has two counters
- Synchronization counter counts number of synchronization signals received
- Clock counter incremented after each WAIT
When thread reaches a WAIT, execution continues
CISC 879 : Software Support for Multicore Architectures
• When thread reaches a WAIT, execution continues only if synchronization counter greater or equal to clock counter
• WAIT implemented with an active loop
• SIGNAL is a non-blocking atomic add-in-memory instruction
Innermost Loop Tiling
• Allows for coarser-grain synchronization
• Execution of Nn - 1 instances of the innermost loop pattern is tiled into tiles of G iterations
• WAIT and SIGNAL are issued at the entrance and exit of each tile
CISC 879 : Software Support for Multicore Architectures
exit of each tile
• Gmin, value of G that minimizes final schedule length, can be approximated at compile time
Cross-IterationRegister Dependences
• Assume thread units do not share registers
• Insert copy operations to copy value from one thread unit to next
• Register dependence transformed into memory dependence
CISC 879 : Software Support for Multicore Architectures
dependence
• Issue memory spill instruction to copy from register to scratch-pad memory of destination thread
• Value restored using local memory load
Cross-IterationRegister Dependences
• Memory spill instructions only need to be issued by the last iteration of an iteration group
• Memory restore instructions only need to be issued by the first iteration
CISC 879 : Software Support for Multicore Architectures
• If distance of dependence is greater than 1, cascading copies and memory spills/restores will bring value to target iteration
Cross-IterationDependence Example
CISC 879 : Software Support for Multicore Architectures
Correctness & Properties
• The multi-core final schedule represented by the schedule function is correct
• The multi-threaded final schedule is deadlock-free
• The synchronization signal guarantees that the memory accesses preceding it on the same thread
CISC 879 : Software Support for Multicore Architectures
memory accesses preceding it on the same thread unit have been committed
Experimental Framework
• The MTS method has been implemented on the Open64 compiler retargeted for the IBM Cyclops64 architecture
• Loop nests from the Livermore Suite, SPEC2000 and NAS were used
CISC 879 : Software Support for Multicore Architectures
Execution Time Speedup
CISC 879 : Software Support for Multicore Architectures
• MTS schedules showed very good scalability, with relative speedup between 57.5 and 81 for 99 threads
Loop Tiling Factor
CISC 879 : Software Support for Multicore Architectures
• Timing results using tiling factor Gmin match results using best empirical tiling factor