Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Using Machine Learning to Simplify theIdentification of Code OptimizationRohith Kumar Uppala; John Dennis; Youngsung Kim
National Center for Atmospheric Research, Boulder, COBridgewater State University, Bridgewater, MA
OVERVIEW
DATASET AND FEATURES
System Architecture
RESULTS
FUTURE WORKChart 1. Hyper Parameter Tuning with Grid Search CV
Chart 2 : Confusion Matrix ; 0 : Non Vectorized, 1: Vectorized
ABSTRACT
CONTACT
Figure 1. Project Motivation
Rohith Kumar UppalaNCAREmail: [email protected]
Many of the scientific applications that execute on large scale parallel computing platforms run in a sub-optimal fashion. Frequently, modest changes or optimizations to the internal calculation of the applications can significantly reduce the time-to-solution and improves the code quality. While it is frequently easy to make these code modifications, it is non-trivial to know exactly which modification should be made. This optimization process requires lot of analysis and human efforts. Using simulation or manual techniques to measure the performance and identify the part of code need to be optimized of the whole source code is often too slow. Our idea is to decrease the human efforts and reduce the time for identification of code optimization by utilizing machine learning techniques to identify which code changes should be applied to certain sections of code based on a detailed performance analysis of an application.
• As problem is both classification and regression task, we chose Random Forest Classifier, because it is
• Less prone to overfitting• Efficient on large datasets• De-correlates the features
• To get the best estimator we need to tune the hyper parameters such as Number of trees, Max Features and Max depth etc.
• Hyper parameter tuning can be done by two methods Grid Search CV and Randomized Search CV
Dataset for this project is created by building and running the source code then using Extrea tool (which uses interposition and Sampling techniques), we can generated Raw Data. By folding tool we are able to get final dataset which contains Hardware Counters, Event value of respective hardware counters. There are totally 90 features present in the dataset. Labelling is done as either 1 or -1. If Vectorization requires then it is -1. Data analysis and manipulations are done using Pandas
Introduction: In this project, by using machine learning techniques to suggest the line number and file name where we may get performance gain by vectorizing the code. We want to mimic what human does by machine through training Random Forest by using hardware counters data.
Models: As this task is supervised classification task, we employed Random Forest model.Results: We are able to suggest lines numbers and file name of the file where we can improve performance by using Machine Learning.
• After predicting the labels from RF, by using raw data we are able get information about the inefficient and expensive code details like Number and File Name.
• Order of results are influenced by predictions from RF with the sample frequencies per line ID.
• Some of the results areclubb_intr.F90 2801:
lapack_wrap.F90 265:
saturation.F90
Data
Dimensionality Reduction
Feature Extraction
Random Forest Model
Hardware Counter Data
Suggestions• Line Number• File Name
Detailed Analysis Report or
Suggestion
Machine Learning Model
Model Selection and Tuning Figure 2. Project Steps and Tools used
ACTU
AL L
ABEL
S
PREDICTED LABELS
[1] Gerard Biau, “Analysis of a random Forest Model”, Journal of Machine Learning Research[2] Extrea Tool : https://tools.bsc.es/extrae[3] Folding analysis tools: https://tools.bsc.es/folding
References
• Currently we are suggesting using only vectorization method., we want to add other optimization techniques.
• Model ensembling : combining classifier by voting or averaging to improve performance.
• We have relatively small dataset to work on. More data covering more optimization techniques give us a better model.