1
Using Machine Learning to Simplify the Identification of Code Optimization Rohith Kumar Uppala; John Dennis; Youngsung Kim National Center for Atmospheric Research, Boulder, CO Bridgewater State University, Bridgewater, MA OVERVIEW DATASET AND FEATURES System Architecture RESULTS FUTURE WORK Chart 1. Hyper Parameter Tuning with Grid Search CV Chart 2 : Confusion Matrix ; 0 : Non Vectorized, 1: Vectorized ABSTRACT CONTACT Figure 1. Project Motivation Rohith Kumar Uppala NCAR Email: [email protected] Many of the scientific applications that execute on large scale parallel computing platforms run in a sub- optimal fashion. Frequently, modest changes or optimizations to the internal calculation of the applications can significantly reduce the time-to- solution and improves the code quality. While it is frequently easy to make these code modifications, it is non-trivial to know exactly which modification should be made. This optimization process requires lot of analysis and human efforts. Using simulation or manual techniques to measure the performance and identify the part of code need to be optimized of the whole source code is often too slow. Our idea is to decrease the human efforts and reduce the time for identification of code optimization by utilizing machine learning techniques to identify which code changes should be applied to certain sections of code based on a detailed performance analysis of an application. As problem is both classification and regression task, we chose Random Forest Classifier, because it is Less prone to overfitting Efficient on large datasets De-correlates the features To get the best estimator we need to tune the hyper parameters such as Number of trees, Max Features and Max depth etc. Hyper parameter tuning can be done by two methods Grid Search CV and Randomized Search CV Dataset for this project is created by building and running the source code then using Extrea tool (which uses interposition and Sampling techniques), we can generated Raw Data. By folding tool we are able to get final dataset which contains Hardware Counters, Event value of respective hardware counters. There are totally 90 features present in the dataset. Labelling is done as either 1 or -1. If Vectorization requires then it is -1. Data analysis and manipulations are done using Pandas Introduction: In this project, by using machine learning techniques to suggest the line number and file name where we may get performance gain by vectorizing the code. We want to mimic what human does by machine through training Random Forest by using hardware counters data. Models: As this task is supervised classification task, we employed Random Forest model. Results: We are able to suggest lines numbers and file name of the file where we can improve performance by using Machine Learning. After predicting the labels from RF, by using raw data we are able get information about the inefficient and expensive code details like Number and File Name. Order of results are influenced by predictions from RF with the sample frequencies per line ID. Some of the results are clubb_intr.F90 2801: lapack_wrap.F90 265: saturation.F90 Data Dimensionality Reduction Feature Extraction Random Forest Model Hardware Counter Data Suggestions Line Number File Name Detailed Analysis Report or Suggestion Machine Learning Model Model Selection and Tuning Figure 2. Project Steps and Tools used ACTUAL LABELS PREDICTED LABELS [1] Gerard Biau, “Analysis of a random Forest Model”, Journal of Machine Learning Research [2] Extrea Tool : https://tools.bsc.es/extrae [3] Folding analysis tools: https://tools.bsc.es/folding References Currently we are suggesting using only vectorization method., we want to add other optimization techniques. Model ensembling : combining classifier by voting or averaging to improve performance. We have relatively small dataset to work on. More data covering more optimization techniques give us a better model.

Using Machine Learning to Simplify the Identification of ...3350/datastream... · Hyper Parameter Tuning with Grid Search CV FUTURE WORK Chart 2 : Confusion Matrix ; 0 : Non Vectorized,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Using Machine Learning to Simplify the Identification of ...3350/datastream... · Hyper Parameter Tuning with Grid Search CV FUTURE WORK Chart 2 : Confusion Matrix ; 0 : Non Vectorized,

Using Machine Learning to Simplify theIdentification of Code OptimizationRohith Kumar Uppala; John Dennis; Youngsung Kim

National Center for Atmospheric Research, Boulder, COBridgewater State University, Bridgewater, MA

OVERVIEW

DATASET AND FEATURES

System Architecture

RESULTS

FUTURE WORKChart 1. Hyper Parameter Tuning with Grid Search CV

Chart 2 : Confusion Matrix ; 0 : Non Vectorized, 1: Vectorized

ABSTRACT

CONTACT

Figure 1. Project Motivation

Rohith Kumar UppalaNCAREmail: [email protected]

Many of the scientific applications that execute on large scale parallel computing platforms run in a sub-optimal fashion. Frequently, modest changes or optimizations to the internal calculation of the applications can significantly reduce the time-to-solution and improves the code quality. While it is frequently easy to make these code modifications, it is non-trivial to know exactly which modification should be made. This optimization process requires lot of analysis and human efforts. Using simulation or manual techniques to measure the performance and identify the part of code need to be optimized of the whole source code is often too slow. Our idea is to decrease the human efforts and reduce the time for identification of code optimization by utilizing machine learning techniques to identify which code changes should be applied to certain sections of code based on a detailed performance analysis of an application.

• As problem is both classification and regression task, we chose Random Forest Classifier, because it is

• Less prone to overfitting• Efficient on large datasets• De-correlates the features

• To get the best estimator we need to tune the hyper parameters such as Number of trees, Max Features and Max depth etc.

• Hyper parameter tuning can be done by two methods Grid Search CV and Randomized Search CV

Dataset for this project is created by building and running the source code then using Extrea tool (which uses interposition and Sampling techniques), we can generated Raw Data. By folding tool we are able to get final dataset which contains Hardware Counters, Event value of respective hardware counters. There are totally 90 features present in the dataset. Labelling is done as either 1 or -1. If Vectorization requires then it is -1. Data analysis and manipulations are done using Pandas

Introduction: In this project, by using machine learning techniques to suggest the line number and file name where we may get performance gain by vectorizing the code. We want to mimic what human does by machine through training Random Forest by using hardware counters data.

Models: As this task is supervised classification task, we employed Random Forest model.Results: We are able to suggest lines numbers and file name of the file where we can improve performance by using Machine Learning.

• After predicting the labels from RF, by using raw data we are able get information about the inefficient and expensive code details like Number and File Name.

• Order of results are influenced by predictions from RF with the sample frequencies per line ID.

• Some of the results areclubb_intr.F90 2801:

lapack_wrap.F90 265:

saturation.F90

Data

Dimensionality Reduction

Feature Extraction

Random Forest Model

Hardware Counter Data

Suggestions• Line Number• File Name

Detailed Analysis Report or

Suggestion

Machine Learning Model

Model Selection and Tuning Figure 2. Project Steps and Tools used

ACTU

AL L

ABEL

S

PREDICTED LABELS

[1] Gerard Biau, “Analysis of a random Forest Model”, Journal of Machine Learning Research[2] Extrea Tool : https://tools.bsc.es/extrae[3] Folding analysis tools: https://tools.bsc.es/folding

References

• Currently we are suggesting using only vectorization method., we want to add other optimization techniques.

• Model ensembling : combining classifier by voting or averaging to improve performance.

• We have relatively small dataset to work on. More data covering more optimization techniques give us a better model.