Automatic Asessment of Programming Exercisep133-Saikkonen

8/3/2019 Automatic Asessment of Programming Exercisep133-Saikkonen

1/4

F u lly A u t o m a t i c A s s e s s m e n t o f P r o g r a m m i n g E x e r c is e sR i k u S a i k k o n e n , L a u r i M a l m i , a n d A r i K o r h o n e nDepartmen t of Com puter Science and EngineeringHels ink i Un ive r s i ty o f Techn ology , F in land{ r j s , l m a , a r c h i e } @ c s , h u t . f i

A b s t r a c tAutomatic assessment of programming exercises has become animportant method for grading students' exercises and givi ng

feedback for them in mass courses. We describe a system calledScheme-robo, which has been designed for assessingprogramming exercises written in the functional programminglanguage Scheme. The system assesses individual proceduresinstead of complete programs. In addition to checking thecorrectness of students' solutions the system provides manydifferent tools for analysing other things in the program like itsstructure and runni ng time, and possible plagiarism. The systemhas been in production use on our introductory programmingcourse with some 300 students for two years with good results.1 In t ro d u c t i o n

Introductory programming courses often have lots of smallexercises, designed to teach students to write small programs first.After solving them the students can move on to solve largerassignments. Such course organisation has also been used inHetsinki University of Technology on the basic programmingCOUrSeS.

The course given for students studying computer science astheir major is based on using the functional programminglanguage Scheme [1,6]. Scheme is no t widely used in "practicalprogramming", but there are over 250 colleges, universities, andsecondary schools that are using Scheme in their curricula [7]. Weuse it on our course because Scheme provides better possibilitiesto present and teach many important and complex concepts incomputer science than more commonly used programminglanguages like Java or C. Unfortuna tely, many practically orientedstudents have not been very motivated to learn an academiclanguage. Therefore we have used small but mandatory weeklyprogrammin g exercises in order to keep the students active duringthe whole course instead of trying to catch on the course justbefore the exams.

In 1997, due to the r apidly increasing number of students, wedid not have enough resources to assess the exercises manually.We had to stop assessing them and make the exercises voluntary.

P erm i ss ion t o m a k e d i g it a l o r h ard cop i es of all or part o f t h i s w o r k f o rpersonal o r c l a s s r o o m u s e i s g r a n t e d w i t h o u t f e e p r o v i d e d thatc o p i e s a r e n o t m a d e o r d is t r i bu t e d f o r p r o f it o r c o m m e r c i a l a d v a n -t a g e a n d t h a t c o p i e s b e a r th i s n o t i c e a n d the full citation on t h e f i r s t p a g e .To c o p y o t h e r w i s e , t o r e p u b l i s h , t o p o s t o n s e r v e r s or tored i s t r i b u t e to lists, req u i res p r i o r sp ec i f i c p erm i ss i on and / ' o r a f ee .ITiCSE2001 6/01 Canterbury, UK 2001 ACM ISBN 1-58113-330-8/01/06...$5.00

This was a serious drawback for the results of the course,because voluntary exercises did not very well support theprementioned aim of keepin g students busy.In order to return to the previous convention, the grading loadneeded to be cut d own heavily. One solution to this problem is toassess the exercises automatically or semi-automatically. Many

such systems have recently been presented for programmingcourses [2,4], data structures and algorith ms courses [5] andothers [3]. We have used the Ce ilidh s ystem [2] to assessexercises written in C or Java on the basic programming coursefor students studying computer science as their minor since theyear 1994.

The main principle of Ceilidh is to use string matching tocompare the output of the student's program with the modeloutput given by the teacher. This approach is not very suita ble forour Scheme course, because in most exercises students writesingle Scheme procedures instead of complete programs. This iscloser to the idea of im plemen ting algorithms than to writingprograms. Moreover, it seems unimpo rtant to require the studentsto write trivial code just for reading input and writing outputvalues. Therefore we would have had to make artificialmodifications to the exercises, which we did not want to do.One solution would have bee n to use a semi-automatic approachsuch as presen ted by Jackson [4]. Thi s approach requir es theteacher to monitor the assessment process and to do some of itmanually. However, even this would cause too much work onlarge courses with lots of exercises.

To solve the problem we developed Scheme-robo,an automaticassessment system for Scheme exercises. It assesses exercisescompletely automatically without human intervention.Scheme-robo has a number of features that we consider

important. First, the assessment is carried out on-line, so thatstudents get their results almost immediately and can resubmit afailed exercise after reconsidering their solution. Limiting thenumber of submissions forces them to think about the reason forthe failure instead of plain trial-and-error debugging. Second, weavoid the problem of comparing the expected result with slightlydifferent output from student code [4], since most of the exercisesare Scheme procedures that return a value instead of printing itout. Return values can be analysed within the Scheme language,usually by simply comparing them to the correct answer using astandard equality predicate. Third, the Lisp-like syntax enables usto easily analyse the structure of the students' code. This is veryuseful in assignments where students fill in the given exercisecode. The same feature can be used to detect plagiarism, as well.Fourth, we can set restrictions on what language constructs theyare allowed to use for particular exercises. Fifth, Scheme-roboallows us to use randomised tests. Finally, the use of Schemeallows us to run code in a fully safe "sandbox", so that maliciousor non-t ermina ting code does not harm the assessment. We canactually use the run time c ontrol even for rough run time analysis,

133


2/4

i.e., if a linear time solution is required, our system catchespossible O(N2) algorithms.The system has been in production use for two years, with about60exercises and 300students per year, and it has workedsuprisingly well. All exercises require the student to write either

Scheme code or short verbal answers (for example, complexityfigures).The structure of this paper is the following. In the next sectionwe explain briefly how students use the system. In Section 3 weconsider more closely how the student code is analysed in variousways and giv e an example o f setting up an exercise. In Sectio n 4we present a general view of the system architecture and inSection 5 we discuss some results and observations we have madeduring these two years. The final section includes some finalremarks and ideas for future work.2 S t u d en t ' s p o i n t o f vi ew

Students submit their solutions to Scheme-robo via e-mail. Thesystem automatically responds to them with a receipt containing acopy of their submission and a set of points and comments to thesubmitted exercises.Currently Scheme-robo responds within about ten minutes.While small response times are usually preferred, in this case an

instant response would encourage the students effectively todebug their code in the Scheme-robo system by trial and error.Ten minutes was selected to encourage the students to test theircode by themselves before submitting it.We have a fixed deadline for finishing each set of exercises, andthe students may submit a solution for every exercise for up to20 times. In practice, most students get it right with the first fewtries.

New exercises are automatically sent to the students via e-mail.The students can also request copies of the exercises and thereport of their current points by sending e-mail to the Scheme-robo system.3 Teach er ' s p o i n t o f v i ew

To add a new exercise into the Scheme-robo system, the teacherneeds to write a configuration file that describes how the solutionsare assessed. The content of the file depends on the type ofassessment. Usually the files are about 20 to 100 lines long andcontain test runs, model solutions and/or other information.Scheme-robo assesses solutions in several ways. The mainmethod is to execute the code using test runs specified by theteacher and to exa mine the return values. The system can also dosome analysis on the structure of the submitted code, e.g., to makesure that it conforms to a skeleton given in the problem statement.In addition, the system can check for specific keywords in thecode. Finally, simple pattern matching based on regularexpressions is used in such exercises where students do not writecode. For example, the problem might ask for a set of numbers ora complexity figure (e.g. O(N)), or it could be simply a multiple-choice question.

Each of these assessment methods has its own type ofconfiguration which is described below more in detail. Moreover,another configuration file can be included into the current one.This feature is used for combi ning different methods, for example,when analysing code structure before executing test runs. It is alsoused for sharing parts of the configuration between differentexercises, e.g., a set of test run s for all exercises havin g to do witha particula r topic.In order to help in testing the configuration files, we have

written a set of simple scripts that automatically check wour model answers are passed by the configuration files.The Scheme-robo system gives points for each executedPoints from indiv idual tests are summed together to get thpoints. There is also a possibility to give a fa il value fnumber of points, which means that the exercise is immedrejected, without executing further tests. In practice, weassessed the exercises simply as passed or failed.3.1 Test r u n s

The main method of assessing student code is to execuruns and exami ne their results. All test runs are in dependeach other.Most of the test runs are "fixed"; that is, expressions areby the teacher and tested with every student. For example, w

test if the Scheme expression (f a ct 5) returns 120, whthe factorial of 5, calculated using a fa ct procedure subby the student.A general problem in automatic assessment is the differewording and layout in students' answers. For exampleprogram could print out "The factorial is 120." while acould say "5! --- 120". We are able to avoid this probllooking at return values from procedures instead of the output strings. Moreover, the students are also spared fropedagogically unimportant effort of writing output routines.

In addition to fixed tests, there is a possibility to irandomness in the tests. Consider, for example, an exercisestudents implement a sorting algorithm. We can specify a tthat generates five ra ndom lists of numbers, sorts them usistudent's code, and compares the result to what the model soreturns for the same inputs.1

Many exercises require the use of code given in the texWe therefore have the facility of including a fixed definitions to be executed before the student's code whe nthe test runs. This feature is useful for various suppprocedures, or when students need to modify large secticode and it is more con venient to submit only the moprocedure definitions.3 . 2 Tes t i n g p ro g ra m s t ru c t u re

Keyword search For some exercises, we want to disallouse of certain Scheme primitives. For example, there exercise to reimplement the re ve rs e primitive that revelist. A solution that uses the same primitive to implemenshould naturally not be accepted.

The list of keywords to be rejected is set up in the configufile. This is practical, because Scheme does not have a largeprimitives. For instance, we can require a purely funimplementatio n of an exercise simply by disallowin g prims e t i , s e t - c a r : an d s e t - c d r :.Analysis o f p r o g r a m s t r u c t u r e In a ddition to keyword sScheme-robo can also do some simple analysis on the strucstudent code. Sche me or Lisp code is trivia lly reduce d intostructure which represents a kind of abstract syntax tree. Thcan be examined to look for particular subpatterns orspecific general structure.In practice, this feature has mostly been used to check wthe student has followed a mandatory skeleton given

i A model solution is actually needed only if therando m tests.

134


3/4

problem statement.Detecting plagiarism Plagiarism detection is important whenassessing exercises automatically. Some preprocessing ofsubmitted code is carded out before this. Individ ual solutions are

reduced to a kind of abstract syntax tree. Variable names areremoved and, for example, definitions and arguments ofcommutative primitives are sorted in a particular order. Theresulting trees are fina lly compared to each other.We observed that our exercises are too small for detectingplagiarism from individual exercises. Therefore the systemcompares all exercises, looking for pairs of students that have

many similar solutions. A plagiarism estimate is given for eachpair of students, and suspicious pairs are examined manually.3 .3 Feedba ck to t he s t udent

The student is given various kinds of feedback for his solution.Each test has a set of possible outcomes given in the configurationfile, for example, a set of possible answers for a test run. Theconfiguration file specifies points and comments for eachoutcome.The comments are thus given on the basis of individu al tests. Inpractice, we have mostly used them as error messages: the mostcommon comments identify the test run that failed. Comments

could also be used as warnings ("this didn't work, but that's ok")or as encouragement (when the student's solution includessomething more than what was required in the problemstatement).3.4 An e x a m p l e

One of the first exercises on our course has been to write aprocedure to compute the cube root of its argument using amethod given in the textbook ( exercise 1.8 of [1]).The exercise requires students to use internal definitions (i.e.,block structure). This is checked via structural analysis. In thefollowing co nfigur ation file, .9 .9 represents a ny s ymbol or liststructure, and .9 ? * zero or more such. The solution is rejected ifinternal definitions are not found. Otherwise one point is given.( i f ( s t r u c t u r e( ? ? *

( d e f i n e ( c u b e - r o o t ? ?)? ? *( d e f i n e ( ? ? ? ? * ) ? ? * )??*)

? ? * ) )(i )( f a i l " N o i n t e r n a l d e f i n i t i o n s . " ))

The Scheme primitive e xp t must not, of course, be used in thesolution. This is tested by the following code, which either rejectsthe solution or gives 0 points.( i f ( k e y w o r d e x p t )

( f a i l )(0))

Test runs can b e specified as follows. Calculating the cube rootof 125 should return 5.0; the configuration below gives one pointfor this, and otherwise rejects the solution with the commentidentifying the procedure call that did not work.( t e s t ( cu b e- ro o t 1 2 5 )

( r 5 . 0 i )( e l se f a i l " * e x p r * d o e s n ' t w o r k " ) )

This is the essence of What is necessary for a configuration file.The configuration actually used for the cube-root problemcontains some more test runs (including random ones); it has 29lines plus empty lines and comments. Creating and testing theconfiguration for a new exercise usually takes less than an hourfor these kinds o f small exercises.4 A r c h i t e c t u r e

Scheme-robo reuses the core of the TRAKLA system [5] tohandle orthogonal tasks of course administration: handling thesubmitted answers, keeping track of points, calculating statistics,etc. The assessment itself is implemented as a "checker module"for TRAKLA. The module receives as input a submitted answerto a single exercise and returns a number of points as its result.Short configuration files are used to specify test runs and otherinformation specific to each i ndividual exercise.4 ,1 Execut ing s tud ent code

Student code is executed in a safe "sandbox", a metacircularScheme interpreter specifically created for this purpose. Theinterpreter was based on the analysing metacircular Schemeinterpreter from [1], and it has been extended to implement mostof the Scheme language specification [6].

The interpreter has some special features that distinguish it froma run-of-the-mill Scheme interpreter. In order to avoid infi niteloops in student code it includes a count of the number ofevaluations done when runni ng a particular test; when this countreaches an exercise-specific maximum, the submitted solutionfails. The same count can also be used for very coarse complexityanalysis. For example, we can measure whether an a lgorithm runsin approximately linear time by specifying a limit for the num berof evaluations that is difficult to achieve with, for example, anO ( N 2 ) algorithm.

The customised metacircular interpreter provides a safesandbox, because we did not include Scheme features for file I/Oor other things that might compromise security when untrustedstudent code is executed. These features have not been necessaryin practice in our exercises.5 O b s e r v a t i o n s f r o m u s i n g t h e sys tem

The automatic assessment has worked quite well and we havebeen able to assess all small exercises automatically.2 In practice,the system has been stable and has required very little manualadministration.

Most of the problems that students have reported with Scheme-robo are due to additional non-standard features of the Schemeimplementation we use on the course. The problems have beenrelatively minor, and we have been able to solve them by addingsupport for such features.

In addition to assessing the exercises, Scheme-robo keepsextensive logs of its operation. This data is useful for monitoringthe progress of the students and, for example, the difficulty ofindivi dual exercises. On the fall 1999 course, we divided ourexercises into five "rounds", so there were deadlines every twoweeks. The logs clearly confirmed the assumption that manystudents do exercises just before deadlines. For instance, about1000 exercises per day were submitted before the first deadline,and only about 60 per day just after it.

2 There is also a larger programming project on the course,which is assessed manually.

135


4/4

5 . 1 T h e s t u d e n t s ' o p i n i o nWe had around 350 students on our course in the autumn term

2000. The exercises were divided in 5 two-week rounds, eachhaving 12 exercises, 4 of which were mandatory. The studentscompleted, on average, 5 out of 12 exercises. The voluntaryexercises gave extra points for the exam.In October 2000, we asked students to give feedback on theautomatic assessment system. Out of 229 respondents, 80 %thought that automatic assessment in general is a good orexcellent idea. The Scheme-robo system was also praised: only10 % of the students thought that it assessed exercises badly. Therest thought that the Scheme-robo system assessed exercisesmoderately well (41%) , weil (44 %) or excellently (6 %).

We also asked about an area that we knew neededimprovement: only 45 % of the students thought that the errormessages (comments) given by Scheme-robo are adequate orgood.6 D i s c u s s i o n

There are some things that need to be taken into considerationwhen creating exercises that are automatically assessed. First ofall, exercises must be well-specified to the level of specifying anorder for funct ion arguments etc.It is necessary to solve new exercises before one can bereasonably certain that the automatic assessment is done correctly.This is, of course, a good thing to do in any case, but it can oftenbe postponed when a ssessing exercises manually.

The system is by no means complete. Better error messages arethe most importa nt area of improve ment in it. In addition,automatic assessment of coding style would be a valuable aid.And finally, we s hould have a mea ns for tailoring the exercises tobe different for each student, as in TRAKLA [5]. But even nowScheme-robo has proved its value as an importa nt tool with greatpedagogical value. Without it, we could not assess the exercisesand give the students enough feedback on such a large course with300 students. Moreover, no reasonable huma n resources would beenough to give feedback so quickly and allow the students tOresubmit their solutions and learn from their errors in this way.We use Scheme on our course, but this approach to automaticassessment should also work well in other programminglanguages. Testing individual procedures instead of completeprograms (thus avoiding the problems caused by I/O, as discussedabove) is possible in any language with an interpreter (perhapseven in a traditional compiled language, if the assessment systemwrites a main() function to do the test runs). We used a custom-made metacircular interpreter, but even a normal run-of-the-millinterpreter could be used (for example, as a subprocess of theassessment system). A safe sandbox for student code is moredifficult to implement in this case, but could be done by, e.g.,preanalysing the code and controlling the execution time, or byusing the safety features of Java. It would be interesting to seewhether this approach to automatic assessment would work inJava.7 A c k n o w l e d g e m e n t sWe thank Kenneth Kronholm and Timo Lilja for writing majorparts of the Scheme-robo system.R e f e r e n c e s[1] Abel son, H., Sussma n, G. J., and Sussman, J.: Structure and

Interpretation of Computer Programs. 2nd Edition. ISBN 0262-51087-1, MIT Press, 1996.[2] Benford, S., Burke, E., Foxley, E., Gutteridge, N., MohdZin, A.: Ceilidh: A Course Adminis tration and MarkingSystem. Proceedings of the International Conference ofComputer Based Learning, Vienna, 1993.[3] English, J., Siviter, P.: Experienc e with an AutomaticallyAssessed Course. Proceedings of the 5th AnnualSIGCSE/SIGCU E Conference on Innovation and

Technology in Computer Science Education, pp. 168-171,Helsinki, Finlan d, 2000.[4] Jackson, D.: A Semi-Autom ated Approach to Onlin e

Assessment. Proceedings of the 5th AnnualSIGCSE/SIGC UE Conference on Innovation andTechnology in Computer Science Education, pp. 164-167,Helsinki, Finland , 2000.[5] Korhonen, A., and Malmi, L.: Algorith m Simulati on withAutomatic Assessment. Proceedings of the 5th AnnualSIGCSE/SIGCUE Conference on Innovation andTechnology in Computer Science Education, pp. 160-163,Helsinki, Finlan d, 2000.[6] Kelsey, R., Clinger, W., and Rees, J. (eds.): Revised5 Rep

on the Algorithmic Language Scheme. A CM SIGPL ANNotices, 33(9), October 1998.[7] Martin, E.: Schools Using Scheme.http://www, schemers, corn/schools, html.

13 6

Documents

Automatic Asessment of Programming Exercisep133-Saikkonen