Plagiarism Monitoring and Plagiarism Monitoring and Detection -- Towards an Open Detection -- Towards an Open
DiscussionDiscussion
Edward L. JonesComputer Information Sciences
Florida A & M University
Tallahassee, Florida
OutlineOutline
What is Plagiarism, and Why Address ItPlagiarism Detection &
CountermeasuresA Metrics-Based Detection ApproachExtending the ApproachConclusions & Future Work
Why Tackle Plagiarism?Why Tackle Plagiarism?
Plagiarism undermines educational objectives
Failure to address sends wrong message
A non-contrived ethical issue in computing
Plagiarism is hard to define
Plagiarism is costly to pursue/prosecute
An interesting problem for tinkering
What is Plagiarism?What is Plagiarism?
“use of another’s ideas, writings or inventions as
one’s own” (Oxford American Dictionary, 1980)
Shades of Gray
– Theft of work
– Gift of work
– Collusion
– Collaboration
– Coincidence
Intent to Deceive
How is it Detected?How is it Detected?
By chance
– Anomalies
– Temporal proximity when grading
Automation methods
– Direct text comparison (Unix diff)
– Lexical pattern recognition
– Structural pattern recognition
– Numeric profiling
Plagiarism Concealment Plagiarism Concealment
TacticsTactics
None
Change comments
Change formatting
Rename identifiers
Change data types
Reorder blocks
Reorder statements
Reorder expressions
Superfluous code
Alternative control
structures
Prosecution -- DA in the Prosecution -- DA in the
House?House? Course syllabus broaches the subject
– Concrete definition generally lacking
– Sense of “we’ll know it when we see it”
N? Tolererance Policy
Investigation Stage
Prosecution Stage
Missed opportunity to teach?
An Awareness ApproachAn Awareness Approach Monitor closeness of student programs
– Objective measures
– Automated
Post anonymous closeness results in public
– Nonconfrontational awareness
– “A word to the wise … “
Benchmark student behavior
– Establishing thresholds
– Effects of course, language
Program 2Program 2
Program 1Program 1
( lines1, words1, characters1
Closeness Measures -- Closeness Measures -- PhysicalPhysical
( lines2, words2, characters2)
Euclidean Distance
Program 2Program 2
Program 1Program 1
( length1, vocabulary1, volume1)
Closeness Measures -- Closeness Measures -- HalsteadHalstead
( length2, vocabulary2, volume2)
Euclidean Distance
Comparison of MeasuresComparison of Measures Physical profile ==> weight test
– Simple/cheap to compute (Unix wc command)
– Sensitive to character variations
Halstead profile ==> content test
– More complex/expensive to compute
– Ignores comments and white space
– Sensitive only to changes in program content
Detection effectiveness vs. plagiarism tactic
Closeness ComputationCloseness Computation Normalization
– Establish upper bound for comparison (1.414)
– Distance computed on normalized (unit) vectors
Normalization I -- Self normalization
– p = (a, b, c) ==> (a/L, b/L, c/L)
– Largest component dominates
Normalization II -- Global scaling
– p = (a, b, c) ==> q = (a/aMAX, b/bMAX, c/cMAX)
– Self normalization applied to q
Distribution Of Closeness Distribution Of Closeness ValuesValues
Figure 2. Distribution of Halstead Closeness
0.000000.005000.010000.015000.020000.025000.030000.035000.040000.04500
0 100 200 300 400 500
Student Program Pairs
Clo
sen
ess
Mea
sure
Figure 2. Distribution of Halstead Closeness
0.000000.005000.010000.015000.020000.025000.030000.035000.040000.04500
0 100 200 300 400 500
Student Program Pairs
Clo
sen
ess
Mea
sure
Closeness DistributionCloseness Distribution
Closeness values vary by assignment Programming language may lead clustering at
the lower end of the spectrum Reuse of modules leads to cluster ingat the
lower end of the spectrum No a priori threshold pin-pointing plagiarism All measures exhibit these behaviors
Suspect IdentificationSuspect IdentificationCollaboration Suspects (5-th Percentile)
Rank Closeness student1 student21 0.00000000 alpha alpha2 0.00000652 alpha beta3 0.00026963 beta gamma4 0.00026981 alpha gamma5 0.00031262 gamma epsilon6 0.00048815 sigma delta7 0.00049825 alpha epsilon8 0.00050169 beta epsilon9 0.00066481 gamma theta
10 0.00073158 beta theta
Independence IndexIndependence IndexStudent Independence Indices
Index student1
1 alpha 2 beta
3 gamma5 epsilon6 sigma6 delta9 theta
Index = position at which student debuts on Closeness List
Preponderance of EvidencePreponderance of Evidence Historical Record of Student Behavior
– Collaboration/partnering
– Independence indices
Profile and analyze other artifacts
– Compilation logs
– Execution logs
Another ApproachAnother Approach
Make student demonstrate familiarity with
submitted program
– Seed errors into program
– Time limit for removing error and resubmitting
Holistic approach
– Intentional, not accidental
ConclusionsConclusions
We can do something about plagiarism -- the first step is to develop eyes and ears
Simple metrics appear to be adequateTools are essentialSophistication is not as necessary as
automationStudents are curious to know how they
compare with other students
On-Going & Future WorkOn-Going & Future Work
Complete the toolset– Student Independence Index
Incorporate other Artifacts– Compilation logs– Execution logs
Integrate into Automated GradingDisseminate Results
– Package tool as shareware