Upload
rachel-fitzgerald
View
22
Download
0
Embed Size (px)
DESCRIPTION
Learning A Better Compiler. Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning. Predicting Unroll Factors. Loop Unrolling sensitive to unroll factor Current solution: expert design - PowerPoint PPT Presentation
Citation preview
Learning A Better Compiler
Predicting Unroll Factors using Supervised Classification
And
Integrating CPU and L2 Cache Voltage Scaling using Machine Learning
Predicting Unroll Factors
• Loop Unrolling sensitive to unroll factor
• Current solution: expert design– Difficult: Hand-tuned heuristics– Must be rewritten frequently
• Predict parameters with machine learning– Easy: data collection takes ~1wk
• No human time
– Algorithm does not change with compiler
Loop Unrolling
• Combines multiple iterations loop body
• Fewer Iterations Less Branching
• Allows other transformations:– Exposes adjacent memory locations– Allows instruction reordering across
iterations
Unroll Factors
• How many iterations to combine?
• Too few?– Provides little benefit
• Too large– Increased cache pressure– Increase live rangeregister pressure
QuickTime™ and a decompressor
are needed to see this picture.
Optimal Unroll Factors
Classification Problems
• Input a vector of features– E.g. nest depth, # of branches, # of ops
• Output a class– E.g. unroll factor, 1-8
• No prior knowledge required– Meaning of features/classes– Relevance of features– Relationships between features
Nearest Neighbors
• Paper describes Kernel Density Estimator
• All dimensions normalized to [0,1]
• Given a test point p:– Consider training points “close” to p
• Within fixed distance, e.g. 0.3
– Majority vote among qualifying training points
Nearest Neighbors
QuickTime™ and a decompressor
are needed to see this picture.
Support Vector Machine
• Assume two classes, easily generalized
• Transform data– Make classes linearly separable
• Find line to maximize sep. margin
• For test point:– Perform transformation– Classify based on learned line
Maximal Margin
QuickTime™ and a decompressor
are needed to see this picture.
Non-Linear SVM
QuickTime™ and a decompressor
are needed to see this picture.
Some Features
• # operands• Live range size• Critical path length• # operations• Known tripcount• # floating point ops• Loop nest level• # branches
• # memory ops• Instruction fan-in in
DAG• # instructions• Language: C, fortran• # memory ops• # Implicit instructions• & more (38 total)
Results: No Software Parallelism
QuickTime™ and a decompressor
are needed to see this picture.
Results: With Software Parallelism
QuickTime™ and a decompressor
are needed to see this picture.
Big Idea: Easy Maintenance• Performance improvements modest
– Sometimes worse, sometimes much better– Usually little change
• Requires no re-tuning to change compiler– Gathering data takes ~1wk, no human time
• General mechanism– Can be applied to all parameters– No model of system needed
• Can be applied to new transformations where expert knowledge is unavailable
Integrated CPU and L2 Cache Voltage Scaling using Machine
Learning
Dynamic Voltage Control
• Monitor system
• When activity is low, reduce power– Also reduces computational capacity– May need more energy if work takes longer
Multiple Clock Domains
• Adjust separate components independently
• Better performance/power– E.g. CPU-bound application may be able to
decrease power to memory and cache without affecting performance
• More complex DVM policy
Motivation
• Applications go through phases
• Frequency/voltages should change too
• Focus on core, L2 cache– Consume large fraction of total power
• Best policy may change over time– On battery: conserve power– Plugged in: maximize performance
Learning a DVM Policy
• Compiler automatically instruments code– Insert sampling code to record perf. Counters– Instrument code only to gather data
• Use machine learning to create policy
• Implement policy in microcontroller
ML Parameters
• Features– Clock cycles per instruction– L2 accesses per instruction– Memory access per instruction
• Select voltage to minimize:– Total energy– Energy*delay
Machine Learning Algorithm
• Automatically learn set of if-then rules– E.g: If (L2PI >= 1) and (CPI <=0) then
f_cache=1GHz
• Compact, expressive
• Can be implemented in hardware
QuickTime™ and a decompressor
are needed to see this picture.
Results
• Compared to independently managing core and L2:– Saves 22% on average, 46% max
• Learns effective rules from few features• Compiler modifications instrument code• Learned policy offline• Implemented policy in microcontroller
Conclusion
• Machine learning derives models from data automatically
• Allows easy maintenance of heuristics
• Creates models that are more effective than hand-tuned