Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel...

Statistical Modeling of Feedback Datain an Automatic Tuning System

Richard Vuduc, James Demmel (U.C. Berkeley, EECS){richie,demmel}@cs.berkeley.edu

Jeff Bilmes (Univ. of Washington, EE)bilmes@ee.washington.edu

December 10, 2000Workshop on Feedback-Directed Dynamic Optimization

Context: High Performance Libraries

Libraries can isolate performance issues– BLAS/LAPACK/ScaLAPACK (linear algebra)– VSIPL (signal and image processing)– MPI (distributed parallel communications)

Can we implement libraries …– automatically and portably– incorporating machine-dependent features– matching performance of hand-tuned

implementations– leveraging compiler technology– using domain-specific knowledge

Generate and Search:An Automatic Tuning Methodology

Given a library routine Write parameterized code generators

– parameters• machine (e.g., registers, cache, pipeline)• input (e.g., problem size)• problem-specific transformations

– output high-level source (e.g., C code) Search parameter spaces

– generate an implementation– compile using native compiler– measure performance (“feedback”)

Tuning System Examples

Linear algebra– PHiPAC (Bilmes, et al., 1997)– ATLAS (Whaley and Dongarra, 1998)– Sparsity (Im and Yelick, 1999)

Signal Processing– FFTW (Frigo and Johnson, 1998)– SPIRAL (Moura, et al., 2000)

Parallel Communcations– Automatically tuned MPI collective operations

(Vadhiyar, et al. 2000)

Related: Iterative compilation (Bodin, et al., 1998)

Road Map

Context The Search Problem Problem 1: Stopping searches early Problem 2: High-level run-time selection Summary

The Search Problem in PHiPAC

PHiPAC (Bilmes, et al., 1997)– produces dense matrix multiply (matmul)

implementations– generator parameters include

• size and depth of fully unrolled “core” matmul• rectangular, multi-level cache tile sizes• 6 flavors of software pipelining• scaling constants, transpose options, precisions, etc.

An experiment– fix software pipelining method– vary register tile sizes– 500 to 2500 “reasonable” implementations on 6

platforms

A Needle in a Haystack, Part I

Road Map

Problem 1: Stopping Searches Early

Assume– dedicated resources limited– near-optimal implementation okay

Recall the search procedure– generate implementations at random– measure performance

Can we stop the search early?– how early is “early?”– guarantees on quality?

An Early Stopping Criterion

Performance scaled from 0 (worst) to 1 (best) Goal: Stop after t implementations when

Prob[ Mt <= 1- ] < – Mt max observed performance at t

– proximity to best– error tolerance– example: “find within top 5% with error 10%”

• = .05, = .1

Can show probability depends only on F(x) = Prob[ performance <= x ]

Idea: Estimate F(x) using observed samples

Stopping time (300 MHz Pentium-II)

Stopping Time (Cray T3E Node)

Road Map

Problem 2: Run-Time Selection

Assume– one implementation is

not best for all inputs– a few, good

implementations known– can benchmark

How do we choose the “best” implementationat run-time?

Example: matrix multiply, tuned for small (L1), medium (L2), and large workloads

C = C + A*B

Truth Map (Sun Ultra-I/170)

A Formal Framework

Given– m implementations– n sample inputs

(training set)– execution time

Find– decision function f(s)

– returns “best”implementationon input s

– f(s) cheap to evaluate

SsAaxaT

Solution Techniques (Overview)

Method 1: Cost Minimization– minimize overall execution time on samples

(boundary modeling)• pro: intuitive, f(s) cheap• con: ad hoc, geometric assumptions

Method 2: Regression (Brewer, 1995)– model run-time of each implementation

e.g., Ta(N) = b3N 3 + b2N 2 + b1N + b0

• pro: simple, standard• con: user must define model

Method 3: Support Vector Machines– statistical classification

• pro: solid theory, many successful applications• con: heavy training and prediction machinery

Results 1: Cost Minimization

Results 2: Regression

Results 3: Classification

Quantitative Comparison

Method Misclass.Average

errorBest5%

Worst20%

Worst50%

Regression 34.5% 2.6% 90.7% 1.2% 0.4%

Cost-Min 31.6% 2.2% 94.5% 2.8% 1.2%

SVM 12.0% 1.5% 99.0% 0.4% ~0.0%

Note:Cost of regression and cost-min prediction ~O(3x3 matmul)Cost of SVM prediction ~O(32x32 matmul)

Road Map

Conclusions

Search beneficial Early stopping

– simple (random + a little bit)– informative criteria

High-level run-time selection– formal framework– error metrics

To do– other stopping models (cost-based)– large design space for run-time selection

Extra Slides

More detail (time and/or questions permitting)

PHiPAC Performance (Pentium-II)

PHiPAC Performance (Ultra-I/170)

PHiPAC Performance (IBM RS/6000)

PHiPAC Performance (MIPS R10K)

Needle in a Haystack, Part II

Performance Distribution (IBM RS/6000)

Performance Distribution (Pentium II)

Performance Distribution (Cray T3E Node)

Performance Distribution (Sun Ultra-I)

Cost Minimization

Decision function

)(maxarg)( swsfa

aa saTswCam

1),()(),,(

Minimize overall execution time on samples

Softmax weight (boundary) functions

Regression

Decision function

)(minarg)( sTsf aAa

3)( NNNsTa

Model implementation running time (e.g., square matmul of dimension N)

For general matmul with operand sizes (M, K, N), we generalize the above to include all product terms– MKN, MK, KN, MN, M, K, N

Support Vector Machines

Decision function

ssKybsL

Binary classifier

)(maxarg)( sLsf aAa

Proximity to Best (300 MHz Pentium-II)

Proximity to Best (Cray T3E Node)

Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel...

Documents

01/24/2006CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel@cs.berkeley.edu demmel/cs267_Spr06

1 CS267/E233 Applications of Parallel Computers demmel/cs267_Spr09 Lecture 1: Introduction Jim Demmel EECS & Math Departments demmel@cs.berkeley.edu

Demmel SC14 Tutorial final v2 - EECS at UC Berkeleydemmel/SC14_tutorial/Demmel_SC… · 11/17/14% 1% Introduc.on%to%% Communica.on3Avoiding%Algorithms% % demmel/SC14_tutorial! Jim%Demmel%

Introduction to Communication-Avoiding Algorithms demmel/SC12_tutorial Jim Demmel EECS & Math Departments UC Berkeley

CS267 L14 Graph Partitioning I.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 14: Graph Partitioning - I James Demmel demmel/cs267_Spr99

Demmel Chapter 1

Introduction to Communication-Avoiding Algorithms demmel/SC14_tutorial Jim Demmel EECS & Math Departments UC Berkeley

CS267 L1 IntroDemmel Sp 1999 CS267 / E233 Applications of Parallel Computers Lecture 1: Introduction 1/18/99 James Demmel demmel@cs.berkeley.edu demmel/cs267_Spr99

James Demmel cs.berkeley/~demmel/cs267_Spr99

CS267 Lecture 61 Shared Memory Programming: Threads and OpenMP Lecture 6 James Demmel demmel/cs267_Spr14/ demmel/cs267_Spr14

Ma221 - Multigrid DemmelFall 2004 Ma 221 – Fall 2004 Multigrid Overview James Demmel demmel/ma221/Multigrid.ppt

01/19/2005CS267-Lecture 11 CS267/E233 Applications of Parallel Computers Lecture 1: Introduction James Demmel demmel@cs.berkeley.edu demmel/cs267_Spr05

01/31/2005CS267 Lecture 41 CS 267: Shared Memory Machines Programming Example: Sharks and Fish James Demmel demmel@cs.berkeley.edu demmel/cs267_Spr05

01/26/2005CS267 Lecture 31 CS 267: Introduction to Parallel Machines and Programming Models James Demmel demmel@cs.berkeley.edu demmel/cs267_Spr05

CITRIS James Demmel EECS and Mathematics Depts. University of California at Berkeley demmel@cs.berkeley.edu Center For Information Technology Research

Jim Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu

Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV) James Demmel demmel/cs267_Spr10

When Cache Blocking of Sparse Matrix Vector Multiply Works and Why By Rajesh Nishtala, Richard W. Vuduc, James W. Demmel, and Katherine A. Yelick BeBOP

1 CS267 Applications of Parallel Computers demmel/cs267_Spr12/ Lecture 1: Introduction Jim Demmel EECS & Math Departments demmel@cs.berkeley.edu

Autotuning (1/2) - Vuduc