View
218
Download
5
Category
Preview:
Citation preview
Working Group on Methodology for Optimizing
Multilevel Parallelism
Fialho, Gimenez, Tallent, Welton, Morris, Malony, Montoya and Browne
Working Assumptions:“Optimal” Parallelism = Optimum Productivity
• Formulate performance optimization problem as find “optimal” parallelism
• Best possible balance of the several modes of parallelism:• Intra-core• Intra-chip• Intra-node• Inter-node
• Multiple interacting factors each with many options• Intra-chip memory access• Intra-node memory access• Concurrency (threading, vectorization, acceleration)• Internode communication• Load Balance
• Optimization with consideration of interactions
Current Status of Tools
• Separate tools for optimizing each factor• Separate tools for optimizing each mode of parallelism• Several different tools for each factor or mode of
parallelism are available• Frameworks for integration of tools and/or creating
“workflows” are available• How do we determine appropriate and consistent
workflows or framework instances from the tools?
Apply a Conceptual Process
1. Specify what is to be optimized2. Specify the metrics needed to diagnosis the bottleneck and
recommend the optimization3. Define the algorithms for diagnosing bottlenecks and
recommending optimizations in terms of the metrics4. Determine the information needed to evaluate those metrics5. Specify how to obtain the information.
Generate a methodology (workflow) from the conceptual process
Two Cases
• Optimize” parallelism of application for given execution environment and input data set with only “local” restructuring• Only “local” source code changes• No algorithm changes
• Re-structure/re-engineer application to attain “optimal” parallelism on (possible) execution environments• Componentize code• Choose different algorithms• Evaluate different component parts and optimize across “components”
• Workflows are different for each, but certainly overlap
Optimization Information Requirements
• Need to incorporate multiple types of information• “Optimize” with only “local” modification• Source code• Execution environment• Runtime behavior
• Optimize with restructuring• Domain• Algorithm• Source code/Execution environment/runtime behavior
Conceptual WorkflowLocal (Inside out) Optimization Workflow
Assumptions: application structure, execution environment and intial conditions/inputs are fixed1. Insure load balance and choose optimal affinity mappings, etc.2. Maximize Intra-node efficiency
1. Intra-core – Maximize vectorization and core-local memory access2. Intra-chip – optimize chip-local memory access3. Intra-node – minimize NUMA accesses4. Intra-node – Choose optimal number of tasks/threads
3. Minimize internode communication cost4. If nodes are at “roofline” for computation or memory bandwidth, then optimize
internode communication5. If nodes are not bottlenecked on either computation or memory bandwidth then
reallocate data to minimize the number of nodes used6. Go to step 2 and repeat
Questions for Further Discussion
• What is the model for restructuring applications to attain “optimal” parallelism?• Can we construct “roofline” analytical models for factors such as
vectorization, threading and communication?• How can we combine software restructuring tools with performance
optimization tools to get “optimal” restructuring workflow?• Roles for offline and online optimization?
Recommended