Razvan Carbunescu, Aditya Devarakonda, Jay Alameda, … › documents › 527334 › 747011 ›...

Preview:

Citation preview

July 15, 2014

Architecting an autograder for parallel code

Razvan Carbunescu, Aditya Devarakonda, Jay Alameda,

James Demmel, Steven I. Gordon, Susan Mehringer

Talk Outline

• Course that motivated autograder

• Autograder concepts and challenges

• Autograder implementation

• Course results

Talk Outline

• Course that motivated autograder

• Autograder concepts and challenges

• Autograder implementation

• Course results

XSEDE Parallel Computing Course

• Created from UC Berkeley course CS267

• Lectures converted for online use (quizzes added)

• Programming assignments require autograder

• Course offered in 2013 for ‘Certificate of Completion’

• Course offered in 2014 for credit at 18 universities in the US and abroad with local instructors

Universities offering course for credit

Programming assignments

• HW1 - Optimizing Matrix Multiply

• HW2 - Parallel Particle Simulator

• HW3 – Parallel Knapsack

* Bottom picture taken from Wikipedia article on Knapsack

= + *C(i,j) A(i,:) B(:,j)C(i,j)

HW 1 – Optimizing Matrix Multiply

• Naïve code 3 loops but

also only 3% arithmetic peak

• Students given naïve

and blocked code, must provide ‘efficient’ code

• Students learn about: memory access, caching, SIMD

and using libraries

HW 2 – Parallel Particle Simulation

• Simplified particle simulator

• Introduces OpenMP, MPI and CUDA

• Students given working O(n2) code

and must provide O(n) code

• Students learn about: synchronization

,locks and domain decomposition

HW 3 – Parallel Knapsack

• 0-1 Knapsack problem

• Introduces UPC

• Students given inefficient

parallel UPC code

• Students learn about: analyzing/minimizing communication, pipeline parallelism

Talk Outline

• Course that motivated autograder

• Autograder concepts and challenges

• Autograder implementation

• Course results

Autograder Concepts

• Testing Correctness

• Testing Performance

• Feedback / automation

• Resource management

Correctness

• What is the right answer? Does it exist?

–ε

–ε

???

Correctness

Problems introduced by parallelism

• Race conditions (non-benign)

• Deadlock / livelock / starvation

• Floating Point and non-determinism

Problems exacerbated by parallelism

• Output size compared to input (gathering, testing)

• Input type and size (precomputed vs random)

Performance

• What is a ‘fast’ or ‘good’ parallel code?

STRONG SCALING WEAK SCALING

Performance

• Sequential metrics: time, percentage of peak

• Strong scaling and speedup

• Weak scaling

• Input dependent performance

• Overhead of correctness check

• Overhead of I/O operations

Feedback / automation

• Providing fast correctness answer

• Providing performance data

• Submission/grade feedback

• Multiple submission capability

• Need for adaptability

Resource Management

• Allocation time vs scaling tests

• Latency due to utilization

• Student limits on allocation

Talk Outline

• Course that motivated autograder

• Autograder concepts and challenges

• Autograder implementation

• Course results

Autograder implementation

• Split into 2 parts:

autograder.cpp grade.py

Autograder.cpp

• Focuses on correctness and performance

• Given to students at start of assignment

• Parts integrated in assignment starting code

• Used other auxiliary files (job scripts, etc.)

• Instant feedback to student

• Limited scaling information

• Varies heavily from assignment to assignment

HW1 Autograder Implementation

• Floating point round-off meant using error norm instead of equalities for correctness checks

• Performance was determined from percentage of peak floating point rate

• Students required to provide defined interface function square_dgemm with compilation options included as comments

HW2 Autograder Implementation

• No previous correctness check except visual

• Implemented empirical statistic checks based on the average and minimum interaction distances for particles

• I/O and correctness turned off for performance runs

• Performance determined coefficient of O(nx) serial algorithm, average strong and weak scaling for 1-16 threads for OpenMP, MPI and from speedup for different problem sizes for CUDA

HW3 Autograder Implementation

• Correctness was implemented via value check

• used average strong and weak scaling efficiency for 1-16 threads and 16-256 threads to check the 2 different stages of UPC (shared and distributed)

Grade.py

• Focuses on final runs and calculating grades

• Very easily modifiable

• Relatively little changes between assignments

• Uses a private copy of autograder.cpp for correctness/performance checks

• Not available to students

Talk Outline

• Course that motivated autograder

• Autograder concepts and challenges

• Autograder implementation

• Course results

Course results

• Universities used different grading schemes based on data from autograder

• High drop-off for undergraduate students (CS267 is a graduate course)

• Students worked individually or in groups of 2

• Most universities had HW3 marked as optional to allow for extra time for final projects

Homework results

• ~150 students started (includes audits)

• 75 HW1 submissions Max:94 Median:41

• 57 HW2 submissions Max:97 Median:30

• 17 HW3 submissions Max:10 Median:5

• 2013 had 345 students and 36/23/18 submissions with 18 ‘Certificate of Completions’

• From universities that finished and communicated data (4 out of 18) we have 38 starting students 25 that finished the course with 17A’s 4B’s 2C’s and rest auditing