[IEEE 2011 IEEE 9th International Symposium on Parallel and Distributed Processing with Applications (ISPA) - Busan, Korea (South) (2011.05.26-2011.05.28)] 2011 IEEE Ninth International

A Novel Approach for Finding Optimization Opportunities in MulticoreArchitectures

Ching-Chi LinInstitute of Information Science,

Academia SinicaTaipei, Taiwan

[email protected]

Pangfeng LiuDept. of CSIE,

Graduate Institute of Networking and Multimedia,National Taiwan University

Taipei, [email protected]

Jan-Jan WuInstitute of Information Science,

Academia SinicaTaipei, Taiwan

[email protected]

Abstract—Compiler techniques for program optimizationshave been well studied for single-thread programs. With theadvance of multi-core architectures, compiler optimizationsfor multi-threaded parallel programs have started to drawresearch attention in recent years. Optimizations for multi-threaded parallel programs on multi-core architectures aremuch more difficult because of the complicated interaction andresource competition between threads. Therefore, identifyingthe appropriate code segments for performing optimizationbecomes one of the most challenging issues. In this work,we propose a novel technique to identify the code segmentsthat exhibit unstable performance behavior and show that byapplying appropriate optimizations to such code segments, theperformance of the parallel program can be improved.Our technique is based on a simple and efficient sampling

method that analyzes variations in the performance variance ofbasic blocks to classify basic blocks into “stable” and “unsta-ble” ones. “Stable” basic blocks have low average coefficient ofvariation(CoV) while “unstable” ones have CoV higher than athreshold value. Such analysis results can be used to determinethe “unstable” code segments that may benefit from runtimeoptimizations. Our experiment results on the SPEC OMP2001benchmark suite demonstrate that the proposed method iseffective in finding “unstable” code segments.

Keywords-parallel program behavior analysis, optimizationopportunities for parallel programs, multi-core architecture,sampling based runtime technique.

I. INTRODUCTIONUnlike static optimization, dynamic optimization works

during a program’s execution. The rationale behind dynamicoptimization is that if a sequence of basic blocks, or traces,are “hot”, meaning that the traces are executed frequently,then optimizing the traces should improve the executionperformance. Thus, dynamic optimization focuses on thereal execution behavior of the program, rather than oninformation that could be obtained from the source codeduring static compilation. Several dynamic optimizers havebeen proposed in the literature, e.g., DynamoRIO [1] andADORE [4].In recent years, multi-core architectures have become

the norm in the computer devices. However, multi-corecomputing raises new issues as well as opportunities for

dynamic program optimizers. For example, most multi-corearchitectures share the cache among cores. Competitionfor shared cache among threads may affect their behaviorand degrade the overall performance. This is not an issuein single core architectures. The competition for cache inmulti-core architectures can cause unstable program be-havior. Static optimizers cannot handle such competitionbecause it can not be assumed that the generated code willonly run in multi-core architectures. Hence, the competitionfor cache can only be handled by a dynamic optimizerduring runtime, by detecting unstable program behavior,and applying runtime optimization to the appropriate codesegments. We denote code segments that exhibit unstablebehavior as dynamic optimization opportunities in multi-corearchitectures.In this paper, our objective is to identify the parts of a

program that are unstable during the execution phase, andthe parts that remain stable in multi-core architectures. Bymaking this distinction, we can apply dynamic optimizationtechniques to the stable parts, and use different techniqueson the unstable parts during runtime. The distinction isimportant because we can apply appropriate optimizationsonly after we know the behavior of the program.To obtain more information about the interference among

different threads, we propose a novel method that observesthe behavior of threads in the basic block domain. Specifi-cally, we determine whether the behavior of each basic blockis stable or unstable over the whole execution period. We areinterested in the blocks that exhibit very unstable behavior,i.e., with a large variance in the metrics, because they mightprovide good optimization opportunities.The major contribution of this work is that we propose a

simple, fast and effective method for finding optimizationopportunities in multi-thread programs running on multi-core architectures. Both static and dynamic optimizers canuse these opportunities to generate more efficient codes orperform more aggressive runtime optimizations.The remainder of the paper is organized as follows. Sec-

tion II describes related works. In Section III we explain how

Ninth IEEE International Symposium on Parallel and Distributed Processing with Applications

978-0-7695-4428-1/11 $26.00 © 2011 IEEE

DOI 10.1109/ISPA.2011.30

238

Ninth IEEE International Symposium on Parallel and Distributed Processing with Applications

978-0-7695-4428-1/11 $26.00 © 2011 IEEE

DOI 10.1109/ISPA.2011.30

238

to choose a representative extended instruction pointer (EIP)for a set of samples. In Section IV, we describe our samplingmethodology. Section V presents our experimental resultsand related analysis. Section VI gives some concludingremarks.

II. RELATED WORKS

A substantial amount of research has been conductedon dynamic optimization. Dynamic optimizers, such asDynamoRIO [1] and ADORE [4], use hot traces to improvethe optimization performance. A hot trace, which is a seriesof frequently executed basic blocks, is put into a code cacheso that it can be accessed quickly the next time it is re-quired. Kister et al. [3] developed a continuous optimizationframework that searches for stable phases in un-optimizedcode, as well as phase changes in previously optimized code,before optimizing a code. The above dynamic optimizersidentify traces mainly based on the execution frequency oftraces, however, the performance information is not takeninto consideration. The information about unstable programsegments provided by our approach can be utilized by dy-namic optimizers to improve the optimization performance.Phase detection is an important component of dynamic

optimizers. A phase is defined as a set of intervals thatexecute the same parts of a program; hence, their runtimebehavior is similar. The Basic Block Vector (BBV) is widelyused as code signature for an execution interval. The BBVof an interval records the number of times each basic blockis executed during that interval. There have been manystudies of phase detection and prediction in recent years.Jiang et al. [2], motivated by the strong correlations amongdifferent types of program behavior, proposed a regression-based framework that automatically identifies a small setof behavior patterns that can lead to accurate prediction ofother types of behavior in a program. Sherwood et al. [7],[8] use BBV as a code signature to find periodic behavior ofprogram phases. By comparing the BBV between intervals,it is possible to detect stable phases and phase changes.Our proposed approach differs from existing approaches

in two ways. First, most phase detection and dynamicoptimization approaches focus on sequential programs andevaluate their techniques with sequential benchmark suites,such as CPU2000 or CPU2006. In contrast, We aim to studythe interaction and interference between multiple threads inparallel programs, such as OMP2001 benchmark programs.Second, in existing approaches, the goal of phase detection isto find simulation points or to predict phases at runtime. Ourgoal is to find optimization opportunities (i.e., code segmentsthat exhibit unstable behavior) in multi-threaded programs,which requires very different phase analysis techniques.

III. CHOOSING REPRESENTATIVE FOR AN INTERVALMany works in the literature employ the notion of basic

block, i.e., a block of instructions with a single entry

and a single exit point, for program behavior analysis.However, the overhead of observing individual basic blocksis extremely large, and thus it may interfere the programbehavior that we want to observe. Instead of consideringevery basic block, we take samples for every fixed timeinterval. Previous studies have shown that taking samplesevery 105 to 106 instructions, and grouping 100 samplestogether as an interval is reasonable. Therefor, we use thissampling rate throughout the paper.To represent an interval, most works use a Basic Block

Vector (BBV), which shows the number of times a basicblock has been executed during an interval. Usually the vec-tor is normalized according to the total number of samples inthe interval. In addition, K-means clustering is widely usedas a classification method in phase detection. The objectiveis to classify basic block vectors into clusters so that eachcluster contains intervals with similar execution behavior.Applying K-means clustering to BBVs yields good clas-

sification results, but it also raises the following issues. First,it is difficult to pre-determine the number of clusters k.Second, K-means clustering can generate different resultsbecause of the initial random seeds. Third, before applyingK-means clustering, we need to randomly project BBV intoa lower dimension to avoid the “dimension curse” problem,i.e., the high overhead incurred when dealing with very largenumber of dimensions in the original BBV. As a result, theclustering results are not directly comparable with each otheror between different runs. Therefore, we need another metricto represent an interval.

A. Using Average EIP

Initially, we try to use the average EIP(extended instruc-tion pointer) to represent an interval. Each interval can beidentified and compared with every other interval by takingthe average EIP of samples. This is a fast and intuitive waycompared to BBV; however, we found that, even though themethod is simple and fast, the average EIP cannot be used torepresent an interval, as explained in the following example.For each interval, we collect its sampled BBV and apply

the k-means algorithm. Each cluster is assigned a differentcolor, as shown in Figure 1. The color of each point inthe figure indicates the cluster that an interval belongs to.Intervals from the same cluster are given the same color.From Figure 1, we conclude that it is not appropriate to

use the average EIP to represent an interval. Ideally, datapoints with the same color should gather together and havesimilar EIPs as their representatives. However, as the figuresshow, the points with the same color are scattered across thex-axis and cause repeated patterns. Repeated patterns occurbecause, during the program’s execution, the instructionpointer may jump to instructions with large EIPs, which cansignificantly increase the average EIP. The result showsthat even if there are only a few samples with large EIP

239239

Figure 1. EIP-IPC relation after clustering for 314.mgrid m of SPECOMP2001

difference, the value of average EIP in an interval will beaffected significantly.

B. Using the ModeTo resolve the repeated pattern problem in Figure 1, we

propose using the mode (i.e., the one that occurs mostfrequently) of the EIP sampled in an interval to representthat interval. We use PIN [5] to build a mapping table todeal with the EIP and its corresponding basic block. TheEIP of each sample in an interval can be mapped to thecorresponding basic block by the mapping table. We choosethe basic block that occurs most frequently and use the EIPof the first instruction of that basic block to represent theinterval.Using the mode is more appropriate than using the

average EIP. As mentioned in the previous section, theaverage EIP will shift significantly, even when there is onlyone very large EIP in the samples. By using the mode tochoose a representative, the instructions in large EIP willbe treated as “noise” and be filtered out, so the mode willnot be affected by noise. Therefore, we can be sure thata representative chosen using the mode will be executedduring an interval.

IV. METHODOLOGYOur goal is to find optimization opportunities in a pro-

gram. If a set of basic blocks exhibit unstable behaviorduring runtime, we consider that it presents an optimizationopportunity. To compare the behavior of each basic block,we need to collect information during the program’s execu-tion.

A. Sampling and GroupingWe collect information about a program’s performance

and execution by sampling at runtime with Perfmon [6], aperformance monitoring tool. A sample is taken after a fixedpre-determined number of instructions. For each sample,Perfmon records one EIP, the number of clock cycles used,and some other relevant information. Then, a fixed size of

continuous samples are grouped together to form a datapoint, or an interval. The total number of clock cycles usedin an interval is calculated and divided by the total numberof instructions retired to get the cycles per instruction (CPI)to represent the performance of the interval.

B. Representative EIPsThe basic block that appears most frequently during

an interval is selected as the representative block for theinterval. In Section III, we explained the reasons for usingthe mode (the one that occurs most frequently), instead ofthe sample BBV or average EIP, to choose the representativeblock.Note that an instruction is too small to be an optimization

unit. A basic block or a set of such blocks would be muchmore suitable. As a result, for every EIP Perfmon sampled,the basic block that the EIP belongs to is important. Webuild a lookup table with PIN [5], which contains the EIPrange of each basic block. Thus, using the table we can mapeach sampled EIP to the correct basic block. The basic blockthat appears most frequently in an interval is selected, andthe first EIP is used as the representative of that interval.

C. VariationTo identify large performance variances, the CoV of the

CPI is used as the evaluation metric to determine if anEIP is unstable. However, this may not be sufficient for theevaluation because some EIPs only have a few intervals, butthey have large CPI variances. Optimizing those intervalsduring runtime may achieve very little performance gain atthe expense of increased overheads. The execution coverageof interval for a representative EIP should also be consid-ered. The EIPs deemed to be “unstable” should have a largeCoV of the CPI and medium coverage during program’sexecution. The evaluation function can be defined as follows:

OptimizationOpportunity = CoV ofCPI × Coverage

(1)The larger the optimization opportunity, the higher theprobability that the performance will improve during run-time optimization. A threshold Too is set, and EIPs withoptimization opportunities above Too are marked and notedby the optimizer.

V. EXPERIMENTAL RESULTSA. Experiment SettingWe conducted our experiments on three different archi-

tectures. The first one is Intel Nehalem Core i7, CPU modelIntel(R) Core2(R) CPU 975 @ 3.33GHz. The second isIntel Xeon, CPU model Intel(R) Xeon(R) CPU E5410 @2.33GHz. The thrid one is Intel Itanium2, CPU model Dual-Core Intel(R) Itanium(R) 2 Processor 9140M. We would liketo know how different cache shared conditions affect theperformance and optimization opportunity of multi-thread

240240

programs. We use Spec OMP2001 as our benchmark suitewith a medium size dataset. Each benchmark runs with 4threads, one per core, and is monitored during the executionby Perfmon [6]. The sampling rate is 105 instructions persample.We analyze the monitoring information provided by Perf-

mon to derive the representative EIPs and performance data.The results are then recorded in four text files, one for eachthread, with the EIP and performance data for each sample.Next, we use the basic block look-up table to map theEIPs to the corresponding basic blocks. The look-up tableis generated off-line using PIN [5]. After mapping the EIPsto the basic blocks, the samples are grouped into intervals.The interval size is set to 100 samples. In each interval,the most frequently occurring basic block is chosen as therepresentative of that interval. Evaluation of the performancebehavior, measured in terms of the CPI, is based on theperformance data of samples in an interval.

B. Classification ResultsBefore comparing the results of each thread, we first

examine the effectiveness of classification. After identifyingthe representative basic block of each interval, we groupintervals with the same representative basic block into acluster, and form an average BBV of the cluster. Then, ineach cluster, we calculate the Manhattan distance betweenthe BBVs and the average BBV. The rationale for usingthe Manhattan distance is that, if two BBVs have a largeManhattan distance, the basic blocks they execute shouldbe very different. We calculate the coefficient of variation(CoV) of the Manhattan distance for each cluster in eachthread.

Figure 2. The average CoV of the L1 distance between BBVs with thesame representative block on i7

Figure 2 shows the average CoV of the Manhattan dis-tance for the clusters in each thread using the O2 optimiza-tion level. We observe that most of the benchmarks havesmall average CoVs (under 0.3), which indicates that theBBVs in the same cluster are similar. In most cases, theCoVs of clusters are small. However, for some benchmarks,such as art, the value is higher than 0.5. The reason is that,

for those benchmarks, there are many other ”frequent” basicblocks in addition to the most frequent basic block. If twoexecution sequences have the same mode basic block, theywill be placed in the same cluster and therefore increase thecluster’s CoV.

C. Performance Comparing ResultsIn this sub-section, we compare the performance results

for each cluster.

Figure 3. CPI results for each thread of the benchmarks Swim in SPECOMP2001 at the O2 optimization level

1) Optimization Level: O0 vs O2: Figure 3 shows theCPI results for each thread in benchmark Swim with theO2 optimization level. A data point represents an interval.The x-axis of each figure represents the first EIP of therepresentative basic block, and the y-axis represents the CPIvalue. The symbol of an interval indicates the thread it istaken from. Data points with the same x-coordinate havethe same representative basic block. If such data points arespread over a wide area, i.e., a long vertical line, we say thebasic block is “unstable”. Conversely, if the data points areclose to each other, the basic block is “stable” in terms ofthe CPI.From our experiments, for most benchmarks at the O0

optimization level, the CPI values locate within the range[0.5, 1.5]. However, at the O2 optimization level, both theaverage CPI and the variance increase, as shown in Figure 3.This is because with O2 optimization, static optimizationtechniques such as loop-unrolling, function in-lining andregister renaming, are applied to generate more efficientcodes. The instructions used in the executables are morecomplicated than those with no optimization, which resultsin larger CPI values. We look for optimization opportunitiesat the O2 level because the CPI variances are much moresignificant at that level.2) Unstable Intervals: Figure 4 shows the optimization

opportunities of the representative blocks in benchmarkSwim. The threshold Too is set at 0.01. If the optimizationopportunity of a cluster exceeds the threshold Too, the clusteris marked as “unstable”. Note that the threshold values are

241241

Figure 4. Optimization opportunities of benchmark Swim

determined by observation. Further investigations are neededto obtain appropriate thresholds.Comparison of the results in Figure 3 and Figure 4

indicate that some basic blocks have large CPI variances,but have a small number of optimization opportunities. Thisis due to small overall coverage of the representative blocks.From Figure 4, we observe that, for most basic blocks, thenumber of optimization opportunities is likely to be small.Those that have a large number of optimization opportunitiesare regarded as “unstable”.

D. Comparison on Different Architectures

Figure 5. Optimization Opportunity results of benchmark Swim onItanium2

Figure 5 and Figure 6 show the optimization opportunityof benchmark Swim running on different architectures. Thethreshold Too is set to be 0.01. As we can observe, whethercache is shared or not, there still exist optimization opportu-nities. The main reason is that cache is not the only resourceshared among cores. Resources such as FSB or memoryare also shared, and can cause competitions that degradeperformance and produce unstable behaviors. Our work canidentify those opportunities no matter the architecture hasshared cache or not.

Figure 6. Optimization Opportunity results of benchmark Swim on Xeon

E. Effect of Dynamic OptimizationIn this section, we assess whether the information about

unstable blocks generated by our method helps improve theprogram’s performance. Because of the lack of dynamicoptimizers, we did optimizations at source code level. First,we choose the unstable basic blocks, whose optimizationopportunity value is higher than threshold Too, from ourresults. Given a basic block, we can look it up in theassembly code using the first EIP of the basic block. Theassembly code is generated from the source code compiledby Intel(R) Compiler. Thus, we can find the correspondingsource code of the basic block from the correspondingassembly code, and apply some optimizations by humanknowledge.1) Benchmark Swim: Benchmark swim is used in our

case study. From Figure 4, we can observe that there arethree representative basic blocks with optimization opportu-nity value over threshold Too. After mapping these threebasic blocks back to the source code, we find that thecorresponding codes belong to three loops, which processa number of large arrays inside a loop body. Here is thecode of one of the loop:

!$OMP PARALLEL DO REDUCTION (+:PCHECK, UCHECK, VCHECK)DO 4500 JCHECK = 1, MNMIN

DO 3500 ICHECK = 1, MNMINPCHECK = PCHECK + ABS(PNEW(ICHECK,JCHECK))UCHECK = UCHECK + ABS(UNEW(ICHECK,JCHECK))VCHECK = VCHECK + ABS(VNEW(ICHECK,JCHECK))3500 CONTINUEUNEW(JCHECK,JCHECK) = UNEW(JCHECK,JCHECK)

1 * ( MOD (JCHECK, 400) /400.)4500 CONTINUE

!$OMP END PARALLEL DO

Loop distribution is manually applied to the loop. ArrayPNEW, UNEW, and VNEW, are too big to fit into the cachesimultaneously. Parallelizing this loop may lead to seriouscache competitions. We separate this loop into three smaller

242242

loops, each process only one array in the origin loop.We compare the execution time between the modification

code with the original. The modification program runs faster(about 11 secs) compared to the original program, whichtakes about 5 minutes. The performance speedup is steadyat about 3.5% in this case. Still, we should notice thatthis optimization is done by human knowledge, and weonly modify the corresponding loops of source code. It isexpected to achieve more performance gains using dynamicoptimizer during runtime.

F. Summary of Experimental ResultsThe experiment results demonstrate that the proposed

method is quite effective because most benchmarks havesmall average CoVs (under 0.3), which indicates that BBVsin the same cluster are similar. Figure 4 shows that, formost basic blocks, the number of optimization opportunitiesis likely to be small. The blocks with optimization opportu-nities above the threshold Too are marked as “unstable”. Wealso compare the performance and optimization opportunityvalues between different architectures. Cache shared or not,and shared at which level, both affect the performance. Butstill there are optimization opportunities because cache isnot the only shared resource that can cause competition.For the benchmark swim, we applied simple optimizationtechniques, such as loop distribution, to the unstable regionof source code identified using our method. The experimentresults show steady improvement in the program’s perfor-mance after the optimization.

VI. CONCLUSIONWe propose a simple and fast method to find optimization

opportunities for dynamic optimization. The method can beapplied to single and multi-threaded programs. By mappingthe EIP of each sample back to the basic block and choosinga representative, we can compare the performance of eachthread in terms of the CPI, and calculate the performancevariance of each basic block. If the variance is small, we saythat the basic block is “stable”. Conversely, if the varianceis large, we mark the basic block as a “possible optimizationopportunity” or “unstable”. Large performance variancesmay occur for a number of reasons; for example, a basicblock may have a lot of memory instructions, or there mayintense between threads for cache. Dynamic optimizationcan utilize the information about unstable regions to improvea program’s performance during execution.Our experiment results show that there are performance

variances in multi-thread programs. We define a “optimiza-tion opportunity” value to measure the variance. If a basicblock has “optimization opportunity” over threshold Too,then it is “unstable” and need to be optimized duringruntime. We compare the performance and optimizationopportunity results between three architectures with differentcache sharing conditions. Unstable basic blocks can be found

in all three architectures because cache is not the only sharedresource. Benchmark Swim is used as a case study. We applyloop distribution to source code segments that correspond tothose unstable basic blocks identified by our method. Theoptimization shows about 3.5% improvement in executiontime.There are still some issues that we need to address. First of

all, we need an optimizer to help us evaluate the overall per-formance gain achieved by applying optimization techniquesto a program when using our analysis information. Anotherissue involves finding a way to compare the performance ofthe same code segment at different optimization levels. Theobjective is to determine the optimization level that is mostsuitable for a procedure or a sequence of basic blocks in aprogram. Compilers might be able to use such informationto perform more accurate optimizations. Analyzing otherbenchmarks on different architectures will also be importantpart of our future work.

REFERENCES[1] D. Bruening, T. Garnett, and S. Amarasinghe. An infrastructure

for adaptive dynamic optimization. In CGO ’03: Proceedingsof the international symposium on Code generation and opti-mization, pages 265–275, Washington, DC, USA, 2003. IEEEComputer Society.

[2] Y. Jiang, E.Z. Zhang, K. Tian, F. Mao, M. Gethers, X. Shen,and T. Gao. Exploiting statistical correlations for proactiveprediction of program behaviors. In CGO ’10: Proceedings ofthe 8th annual IEEE/ACM international symposium on Codegeneration and optimization, pages 248–256, New York, NY,USA, 2010. ACM.

[3] T. Kistler and M. Franz. Continuous program optimization: Acase study. ACM Trans. Program. Lang. Syst., 25(4):500–548,2003.

[4] J. Lu, H. Chen, P.C. Yew, and W.C. Hsu. Design andimplementation of a lightweight dynamic optimization system.Journal of Instruction-Level Parallelism, 6, April 2004.

[5] C.K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, GeoffLowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazel-wood. Pin: building customized program analysis tools withdynamic instrumentation. In PLDI ’05: Proceedings of the2005 ACM SIGPLAN conference on Programming languagedesign and implementation, pages 190–200, New York, NY,USA, 2005. ACM.

[6] Perfmon. http://perfmon2.sourceforge.net/.

[7] T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Auto-matically characterizing large scale program behavior. In Pro-ceedings of the 10th international conference on Architecturalsupport for programming languages and operating systems,pages 45–57, New York, NY, USA, 2002. ACM.

[8] T. Sherwood, E. Perelman, G. Hamerly, S. Sair, and B. Calder.Discovering and exploiting program phases. IEEE Micro,23(6):84–93, 2003.

243243

Documents

[IEEE 2011 IEEE 9th International Symposium on Parallel and Distributed Processing with Applications (ISPA) - Busan, Korea (South) (2011.05.26-2011.05.28)] 2011 IEEE Ninth International