16
AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS Chirag Dave and Rudolf Eigenmann Purdue University

AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

  • Upload
    tavi

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS. Chirag Dave and Rudolf Eigenmann Purdue University. GOALS. Automatic parallelization without loss of performance Use automatic detection of parallelism Parallelization is overzealous Remove overhead -inducing parallelism - PowerPoint PPT Presentation

Citation preview

Page 1: AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

Chirag Dave and Rudolf EigenmannPurdue University

Page 2: AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

GOALS

• Automatic parallelization without loss of performance– Use automatic detection of parallelism– Parallelization is overzealous– Remove overhead-inducing parallelism– Ensure no performance loss over original program

• Generic tuning framework– Empirical approach– Use program execution to measure benefits– Offline tuning

Page 3: AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

AUTO Vs. MANUALPARALLELIZATION

Source Program Hand

parallelized

Parallelizing Compiler

Parallel Program

Significant development time

State-of-the-art auto-parallelization in the

order of minutes

User tunes the program for performance

Page 4: AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

AUTO-PARALLELISM OVERHEADint foo(){ #pragma omp private(i,j,t) for (i=0; i<10; i++) { a[i] = c; #pragma omp private(j,t) #pragma omp parallel for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } }}

fork

joinFork/Join overheads

Load balancingWork in parallel section

Loop level parallelism

Page 5: AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

NEED FOR AUTOMATIC TUNING

• Identify, at compile time, the optimization strategy for maximum performance

• Beneficial parallelism– Which loops to parallelize– Parallel loop coverage

Page 6: AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

OUR APPROACH

Best combination of loops to parallelizeOffline tuning

Decisions based on actual execution time

Page 7: AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

CETUS: VERSION GENERATION

Page 8: AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

SEARCH SPACE NAVIGATION• Search Space -> The set of parallelizable loops

• Generic Tuning Algorithm– Capture Interaction– Use program execution time as decision metric

• COMBINED ELIMINATION– Each loop is an on/off optimization– Selective parallelization

• Pan, Z., Eigenmann, R.: Fast and effective orchestration of compiler optimizations for automatic performance tuning. In: The 4th Annual International Symposium on Code Generation and Optimization (CGO). (March 2006) 319–330

Page 9: AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

TUNING ALGORITHMBATCH ELIMINATION ITERATIVE ELIMINATION

COMBINED ELIMINATION

- Considers separately, the effects of each optimization

- Instant elimination

-Considers interactions-More tuning time

New Base Case

- Considers interactions amongst a subset- Iterates over the smaller subset and performs batch elimination

Page 10: AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

CETUNE INTERFACE

int foo(){ #pragma cetus parallel… for (i=0; i<50; i++) { t = a[i]; a[i+50] = t + (a[i+50] + b[i])/2.0; }

for (i=0; i<10; i++) { a[i] = c; #pragma cetus parallel… for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } }}

cetus –ompGen –tune-ompGen=“1,1”Parallelize both loops

cetus –ompGen –tune-ompGen=“1,0”cetus –ompGen –tune-ompGen=“0,1”Parallelize one and serialize the other

cetus –ompGen –tune-ompGen=“0,0”Serialize both loops

Page 11: AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

EMPIRICAL MEASUREMENT

Input source code (train data set)

Version generation using tuner input

Back end code generation

Runtime performance measurementTrain data set

Decision based on RIP

Next point in the search space

Automatic parallelization using Cetus

Start configuration

Final configuration

ICC

Intel Xeon Dual Quad-core

Page 12: AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

RESULTS

Page 13: AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

RESULTS

Page 14: AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

RESULTS

Page 15: AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

CONTRIBUTIONS• Described a compiler + empirical system that detects parallel loops in

serial and parallel programs and selects the combination of parallel loops that gives highest performance

• Finding profitable parallelism can be done using a generic tuning method

• The method can be applied on a section-by-section basis, thus allowing fine-grained tuning of program sections

• Using a set of NAS and OMP 2001 benchmarks, we show that the auto-parallelized and tuned version near-equals or improves performance over the original serial or parallel program

Page 16: AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

THANK YOU!