Assessment and Checking of SQL: Detecting and Preventing Plagiarism Dr. Gordon Russell, Andrew...

Preview:

Citation preview

Assessment and Checking of SQL: Detecting and Preventing Plagiarism

Dr. Gordon Russell, Andrew Cumming.

Napier University, Edinburgh, Scotland.

TLAD 2005, Sunderland, UK.

Introduction• We have an ILE for teaching Database modules.• Part of this system teaches SQL using an interactive system.• It allows users to:

– Follow about 70 problem-solving exercises.– Each exercise needs SQL statements to solve them.– The student can enter and run SQL statements.– The SQL is automatically assessed and feedback given.– Students use this feedback to learn

• The site also offers incremental assessments to students, with automatic feeback.

Tutorial Index

Accuracy

• The accuracy measure used is, put simply, how similar is the output of executing the student’s SQL to the SQL I wrote as the sample solution.

• The basic algorithm takes the table produced by the student’s SQL statement, and then compares each cell of that table against the sample solution.– If the cell is in both tables then score+1– If the cell is not in the sample solution score-1– Divide the final score by the number of cells in the

biggest of the sample solution table and the student’s SQL table.

Example: Accuracy 4/9 or 45%

SELECT *

FROM people

ID Lastname

1 Russell

7 Kennedy

ID Firstname Lastname

1 Gordon Russell

4 Andrew Cumming

7 Jessie Kennedy

SELECT ID, Lastname

FROM people

WHERE ID IN (1,7)

Student’s Attempt Correct Answer

Assessments

• When students complete a “tutorial” they can take an assessment.

• There are 4 assessments to complete.• Each assessment has 2 questions.• The assessments can be taken at any time, and

thus are unsupervised.• It is important that we are confident that

assessment results are measuring that student’s SQL ability…

System Evolution

• Many changes have been made to the system to– Improve student results– Manage a lack of student attendance at

tutorials– Keep good student interested– Give an impression of one-to-one attention– Removing the need to plagiarise

Why Plagiarise

• An online environment suffers from a number of issues:– Effort: “I am just cheating a machine”– Instant feedback: “It doesn’t say 100% yet”– Peers: “I got 100%, what did you get?”– Time: “Not enough time left to do it myself”

What is plagiarism?

• Here is simply refers to two or more students who have submitted SQL which has a common electronic source.– An emailed solution– Shared answer then distributed

• Some plagiarists have verbatim copies, while others spend time disguising the copies.

Detecting Plagiarism

• We have written a number of algorithms to detect similarity between SQL statements.

• Keeping algorithms separate aids understanding.• Combining the results of 2 algorithms is problematic.• So detection of likely plagiarists is done in first phase, result

in an XML report.• XML is processed and analysis made in phase 2.• Phase 2 produces “high confidence” plagiarist identification.

Equality Testing

• EQUALITY was our first detection algorithm• It looks at the SQL as a single string, and uses

string equality as the test.• It returns a number, with 10 being “they are

identical”, down to (currently) 4 “identical ignoring spaces, brackets, and case”.

• This test usually is all that’s needed.• It really detects “copy and paste” electronic

copying.

Shuffle Match

• SHUFFLE is our “intelligent” matcher.• It ignores all student-generated formatting.• Reformats SQL using a standard format.• It tolerates switching lines around.• It tolerates moving operands around an operator.

WHERE A = BWHERE B = A

Histogram Analysis

• There are only so many ways of writing SQL.• Students argue that copies are just co-incidence.• HISTOGRAM produces evidence to help stop this

argument.• It detects things irrelevant to the SQL standard.

– Trailing spaces at the end of a line– Use of spaces around commas and operators– Different cases of operands or clauses

Histogram

SELECT a,b FROM c where a=(select b from hia);

WHERE a= b AND c= d

CREATE View unique123

<equality quality='9' />

select distinct(d.d_name)from dressmaker d, construction c, material m, \_ dress_order o, order_line olwhere d.d_no = c.maker and c.order_ref = ol.order_ref and c.line_ref = ol.line_no and ol.order_ref = o.order_no and ol.ol_material = m.material_no and c.finish_date is not null and m.fabric = 'Silk' and o.completed = 'Y'

select distinct(d.d_name)from dressmaker d, construction c, material m, \_ dress_order o, order_line olwhere d.d_no = c.maker and c.order_ref = ol.order_ref and c.line_ref = ol.line_no and ol.order_ref = o.order_no and ol.ol_material = m.material_no and c.finish_date is not null and m.fabric = 'Silk' and o.completed = 'Y'

<equality quality='8' /> <signature op='0' comma='1' trail='0' /> <histogram mode='all'> <word str='grm' left='3' right='3' /> <word str='mtr' left='2' right='2' /> <word str='cost' left='3' right='3' /> <word str='Cost' left='1' right='1' /> </histogram>

SELECT distinct g.description, (g.labour_cost + q.quantity * m.cost) "Total \_ Cost" FROM garment g, material m, quantities q WHERE q.size_q = 8 AND m.fabric = 'Cotton' AND g.style_no = q.style_q AND (g.labour_cost + q.quantity * m.cost) > (\_ SELECT AVG(grm.labour_cost + (qty.quantity * mtr.cost)) FROM material mtr, quantities qty, garment grm WHERE qty.size_q = 8 AND grm.style_no = qty.style_q)

SELECT distinct g.description, (g.labour_cost + q.quantity * m.cost) "Total \_ Cost" FROM garment g, material m, quantities q WHERE q.size_q = 8 AND m.fabric = 'Cotton' AND g.style_no = q.style_q AND (g.labour_cost + q.quantity * m.cost) > (\_ SELECT AVG(grm.labour_cost + (qty.quantity * mtr.cost)) FROM material mtr, quantities qty, garment grm WHERE qty.size_q = 8 AND grm.style_no = qty.style_q)

<shuffle score='0.888888888888889' restarts="0" /> <signature op='0' comma='1' trail='1' /> <histogram mode='all'> <word str='labour_cost' left='2' right='2' /> <word str='TRAIL:1' left='2' right='2' /> <word str='LABOUR_COST' left='1' right='1' /> </histogram>

select a.description, b.fabric, b.colour,\_ b.pattern from garment a, material b,quantities c, \_order_line d where a.style_no = c.style_qand c.style_q = d.ol_styleand d.ol_material = b.material_noand (a.labour_cost+c.quantity*b.cost) > (select g.labour_cost from garment g where g.LABOUR_COST > 80);

select a.description, b.fabric, b.colour,\_ b.pattern from garment a, material b,quantities c, \_order_line d where a.style_no = c.style_qand c.style_q = d.ol_styleand d.ol_material = b.material_noand (a.labour_cost+c.quantity*b.cost) > (select g.labour_cost from garment g where g.LABOUR_COST >= 80);

<shuffle score='0.863636363636364' restarts="0" /> <signature op='0' comma='0' trail='1' /><histogram mode='all'> <word str='TRAIL:1' left='1' right='1' /></histogram>

select fabric,round(sum(quantity*cost*1.15),2) from order_line join quantities on(\_ ol_style=style_q and ol_size=size_q) join \_ material on (material_no=ol_material),\_ constructionwhere line_no=line_refand order_line.order_ref=construction.order_ref and FINISH_DATE between '01 jan 2002' and '31 dec 2002' group by fabric;

select fabric,round(sum(quantity*cost*1.15),2)as constructionfrom order_linejoin quantities on(ol_style=style_q and ol_size=size_q) join material on (material_no=ol_material),\_ construction

where line_no=line_refand order_line.order_ref=construction.order_ref and FINISH_DATE between '01 jan 2002' and '31 dec 2002' group by fabric;

Understanding Plagiarism

• With random question banks it is hard to understand plagiarism.

• Everyone doing the same questions it is easy!• Can they copy off of their lab buddy?• What about a friend of a friend?• Do students really have friends-of-friends-of-friends.• Increasing question bank size should help, but this is

expensive, and maintaining equal question difficulty is hard.• Definitely we only want the minimum bank size which

prevents plagiarism oppertunities.

System in 2002

• Students did the assessments when they liked.• They had 15 questions, and did any 5 they wanted

to.• Q1-5 12% each, Q6-10 16% each, Q11-15 20%.• Argument was that this would allow weak students

to do minimum, and good students to shine, without anyone feeling they failed something.

Collusion in 2002

Moving to 2003

• Target: Eliminate big groups.• Switch to random banks.• Force student to do tutorials before assessments.• Have 4 assessments (1 per tutorial).• Each assessment has 2 questions.• Each student will use use 40% of the bank

contents.

Collusion in 2003

Into 2004

• Target: Improve student moral• Students hated failing to get Ass 3 and 4.• Change difficulty level of questions slightly.• Now students get 1 question from previous

difficulty level in every assessment.• Result: Students do more assessments!

Collusion in 2004

Summary

• Plagiarism groups are back!• However, the nodes are quite far apart.• Big blob is an off-campus “support group”.• 2nd biggest is a local support network.• Mostly each student copied only from 1 other.• A slightly bigger bank size may fix this.• We are going to use a completely new bank in 2005/6 to

analyse the impact of answer sharing between year groups.

StatisticsCohort

2002 2003 2004

Total Number of students 312 301 263

Average number of questions attempted

5.7 4.6 5.8

Average Score 60% 51% 64%

Plagiarised questions 208 120 138

Percentage of dishonest answers submitted

12% 9% 9%

Conclusion

• From the table plagiarism is actually holding steady.

• Student marks have increased.• Question attempts have increased too.• This data covers semester 1. In semester 2,

plagiarism has decreased as a result of students realising the effectiveness of the new plagiarism detector.

• 2005/6 will be the big test.