View
5
Download
0
Category
Preview:
Citation preview
July 15, 2014
Architecting an autograder for parallel code
Razvan Carbunescu, Aditya Devarakonda, Jay Alameda,
James Demmel, Steven I. Gordon, Susan Mehringer
Talk Outline
• Course that motivated autograder
• Autograder concepts and challenges
• Autograder implementation
• Course results
Talk Outline
• Course that motivated autograder
• Autograder concepts and challenges
• Autograder implementation
• Course results
XSEDE Parallel Computing Course
• Created from UC Berkeley course CS267
• Lectures converted for online use (quizzes added)
• Programming assignments require autograder
• Course offered in 2013 for ‘Certificate of Completion’
• Course offered in 2014 for credit at 18 universities in the US and abroad with local instructors
Universities offering course for credit
Programming assignments
• HW1 - Optimizing Matrix Multiply
• HW2 - Parallel Particle Simulator
• HW3 – Parallel Knapsack
* Bottom picture taken from Wikipedia article on Knapsack
= + *C(i,j) A(i,:) B(:,j)C(i,j)
HW 1 – Optimizing Matrix Multiply
• Naïve code 3 loops but
also only 3% arithmetic peak
• Students given naïve
and blocked code, must provide ‘efficient’ code
• Students learn about: memory access, caching, SIMD
and using libraries
HW 2 – Parallel Particle Simulation
• Simplified particle simulator
• Introduces OpenMP, MPI and CUDA
• Students given working O(n2) code
and must provide O(n) code
• Students learn about: synchronization
,locks and domain decomposition
HW 3 – Parallel Knapsack
• 0-1 Knapsack problem
• Introduces UPC
• Students given inefficient
parallel UPC code
• Students learn about: analyzing/minimizing communication, pipeline parallelism
Talk Outline
• Course that motivated autograder
• Autograder concepts and challenges
• Autograder implementation
• Course results
Autograder Concepts
• Testing Correctness
• Testing Performance
• Feedback / automation
• Resource management
Correctness
• What is the right answer? Does it exist?
+ε
–ε
+ε
–ε
???
Correctness
Problems introduced by parallelism
• Race conditions (non-benign)
• Deadlock / livelock / starvation
• Floating Point and non-determinism
Problems exacerbated by parallelism
• Output size compared to input (gathering, testing)
• Input type and size (precomputed vs random)
Performance
• What is a ‘fast’ or ‘good’ parallel code?
STRONG SCALING WEAK SCALING
Performance
• Sequential metrics: time, percentage of peak
• Strong scaling and speedup
• Weak scaling
• Input dependent performance
• Overhead of correctness check
• Overhead of I/O operations
Feedback / automation
• Providing fast correctness answer
• Providing performance data
• Submission/grade feedback
• Multiple submission capability
• Need for adaptability
Resource Management
• Allocation time vs scaling tests
• Latency due to utilization
• Student limits on allocation
Talk Outline
• Course that motivated autograder
• Autograder concepts and challenges
• Autograder implementation
• Course results
Autograder implementation
• Split into 2 parts:
autograder.cpp grade.py
Autograder.cpp
• Focuses on correctness and performance
• Given to students at start of assignment
• Parts integrated in assignment starting code
• Used other auxiliary files (job scripts, etc.)
• Instant feedback to student
• Limited scaling information
• Varies heavily from assignment to assignment
HW1 Autograder Implementation
• Floating point round-off meant using error norm instead of equalities for correctness checks
• Performance was determined from percentage of peak floating point rate
• Students required to provide defined interface function square_dgemm with compilation options included as comments
HW2 Autograder Implementation
• No previous correctness check except visual
• Implemented empirical statistic checks based on the average and minimum interaction distances for particles
• I/O and correctness turned off for performance runs
• Performance determined coefficient of O(nx) serial algorithm, average strong and weak scaling for 1-16 threads for OpenMP, MPI and from speedup for different problem sizes for CUDA
HW3 Autograder Implementation
• Correctness was implemented via value check
• used average strong and weak scaling efficiency for 1-16 threads and 16-256 threads to check the 2 different stages of UPC (shared and distributed)
Grade.py
• Focuses on final runs and calculating grades
• Very easily modifiable
• Relatively little changes between assignments
• Uses a private copy of autograder.cpp for correctness/performance checks
• Not available to students
Talk Outline
• Course that motivated autograder
• Autograder concepts and challenges
• Autograder implementation
• Course results
Course results
• Universities used different grading schemes based on data from autograder
• High drop-off for undergraduate students (CS267 is a graduate course)
• Students worked individually or in groups of 2
• Most universities had HW3 marked as optional to allow for extra time for final projects
Homework results
• ~150 students started (includes audits)
• 75 HW1 submissions Max:94 Median:41
• 57 HW2 submissions Max:97 Median:30
• 17 HW3 submissions Max:10 Median:5
• 2013 had 345 students and 36/23/18 submissions with 18 ‘Certificate of Completions’
• From universities that finished and communicated data (4 out of 18) we have 38 starting students 25 that finished the course with 17A’s 4B’s 2C’s and rest auditing
Recommended