Upload
merin-paul
View
520
Download
1
Tags:
Embed Size (px)
Citation preview
04/13/2023 1
Source-code Plagiarism
Presented byMerin Paul
Mtech CS-IS S1
Guide : Ms Sangeetha Jamal Dept of Computer Science
04/13/2023 2
Contents
IntroductionTypes of Source-code Plagiarism
Textual Similarity Functional Similarity
Source Code Detection Algorithms.Detecting TechniquesTools used for code based plagiarism.Conclusion
04/13/2023 3
IntroductionPlagiarism in source-code files occurs when source-code
is copied and edited without proper acknowledgment of the original author.
Techniques for plagiarism: Lexical changes and structural changes.
Lexical changes: changes that can be done to the source-code without affecting the parsing of the program
04/13/2023 4
IntroductionStructural changes: changes made to the source code that
will affect the parsing of the code and involve program debugging.
Reasons for code copying: Code reusing.Programmer limitationCoincidentally implement using the same logic
04/13/2023 5
TYPES OF SOURCE CODE PLAGIARISM
Textual Similarity
Functional Similarity
04/13/2023 6
Textual Similarity
Two individual source codes look similar based on their textual content.
Textual content mean the words, letters, variable names, etc
Type 1, Type 2, Type 3.
04/13/2023 7
Type IThe copied code fragment is as same as the original one
without any modification except white spaces, comments and line modifications.
int a; // counter// count five timesfor(a = 0; a < 5; a++){ printf(“a = %d”, a); // print value of a}return 0;
04/13/2023 8
Type I
int a;
/* Loop increasing of a and print a value of it */
for(a = 0; a < 5; a++){
printf(“a = %d”, a);
}
return 0;
04/13/2023 9
Type IISame as Type I and also with modifications to variable
names, function names and other user-defined identifiers.
if(a > b){ a = a - 1; b = b * a; // comment 1 }else{ b = a; // comment 2a = 0; }
04/13/2023 10
Type IIif(m > n)
{m=m - 5;
n=n*m; //my comment 1
}
else
{n=m; //my comment
2m=0;
}
04/13/2023 11
Type IIIA copied code fragment is done by inserting or
removing unnecessary statements.
if(a > b) { a = a - 1; b = b * a; }else { b = a; a = 0; }
04/13/2023 12
Type IIIif(a > b)
{
a = a – 1;
c = 0; // this statement is added
b = b * a;
}
else
{
b = a;
a = 0;
}
04/13/2023 13
Functional similarity
fragment 1 : fragment 2:
int i , j = 1; int factorial(int n)
for(i = 1; i <= VALUE; i++) {
j = j * i; if(n == 0) return 1;
else return factorial(n – 1)*n;
}
It refers to the code fragments that have the same semantic or functionality.
04/13/2023 14
Source Code Detection AlgorithmsText based Token-based Parse tree-based PDG-based Metrics-based Hybrid Approaches
04/13/2023 15
CONTD..Text based
Find textual match between two source codes..Simple and Fast.
Token based Using a lexer to convert the program into tokens.Find a match in token sequences. More robust to simple text replacements.
04/13/2023 16
CONTD…Parse Trees
Build and compare parsetreesContains the complete information about the
source codeTree comparison can normalize conditional
statements.
Program Dependency Graphs (PDGs) Captures the actual flow of control in a program.Allows higher-level equivalences to be located.More complex.
04/13/2023 17
CONTD…Metrics
capture 'scores' of code segments according to certain criteria.
Metrics are simple to calculate.Lead to false positives.
• HybridCombination of two or more previous
techniques.
04/13/2023 18
Detecting TechniquesDetection via Lexical Similarities
The process of lexical analysis takes source code and converts it into a stream of lexical tokens.
Source code undergoes a series of transformation.Identification of reserved words, identifiers, and
numbers are beneficial for plagiarism detection.
04/13/2023 19
CONTD…
int[] A = {1,2,3,4};for(int i = 0; i < A.length; i++) {A[i] = A[i] + 1;}
int[] B = {1, 2, 3, 4};for(int j = 0; j < B.length; j++) {B[j] = B[j] + 1;}
04/13/2023 20
CONTD…
LITERAL_int LBRACK RBRACK IDENT ASSIGN LCURLY NUM_INT COMMA NUM_INTCOMMA NUM_INT COMMA NUM_INT RCURLY SEMILITERAL_for LPAREN LITERAL_int IDENT ASSIGN NUM_INT SEMI IDENT LTIDENT DOT IDENT SEMI IDENT INC RPAREN LCURLYNUM_INT SEMIRCURLY
04/13/2023 21
Detection via Parse Tree Similarities
04/13/2023 22
Detection via MetricsCalculate and compare attribute counts.
Programs with similar attribute counts are potentially similar programs.
Counts of operators and operands are typically used to construct attribute counts.
04/13/2023 23
Tools used for code based plagiarismJplag
Finds similarities among multiple sets of source code files. JPlag operates in two phases.First phase: All programs to be compared are parsed and
converted into token strings.Second phase: Token strings are compared in pairs for
determining the similarity of each pair.It is more robust. It supports Java, c#, C, C++ and natural
language text.
04/13/2023 24
CONTD..
MOSS (Measure Of Software Similarity)
Measure Of Software Similarity was developed in 1994 by Alex Aiken.
It analyzes code written in languages like C, C++, Python, Visual Basic, Javascript, FORTRAN, Lisp, Ada etc.
Provided as an internet service and given a list of source files.
04/13/2023 25
CONTD… YAP (Yet Another Plague)
Token-based system.YAP works in two phases. The first phase generates a token file for each submission.The second phase compares pairs of token files using the
token matching algorithm, Running-Karp-Rabin Greedy-String-Tiling algorithm (RKRGST)
04/13/2023 26
ConclusionPlagiarism in programming assignments is an inevitable
issue for most academics teaching programming.Plagiarism Detection systems are built based on a few
languages.Most of the detection software checking is done with
some repository situated in an organization. As the number of digital copies are going up the
repository size should be large and the plagiarism Detection software should be able to handle it.
04/13/2023 27
ConclusionPlagiarism in programming assignments is an inevitable
issue for most academics teaching programming.Most popular plagiarism detection algorithms use string-
matching to create token string representations of programs.
The tokens of each document are compared on a pair-wise basis to determine similar source-code segments between the files.
String-matching systems are language-dependent depending on the programming languages supported by their parsers
04/13/2023 28
References1) G. Cosma and M. Joy,” An Approach to Source-Code Plagiarism
Detection and Investigation Using Latent Semantic Analysis” IEEE Trans. Computers, vol. 61, no. 3, pp. 379-391, March 2012
2) Georgina Cosma, Mike Joy, Daniel White and Jane Yau, 9th August 2007 ,ICS,University of Ulster http://www.ics.heacademy.ac.uk/resources/assessment/plagiarism/
3) Okiemute Omuta ”Electronic Source Code Plagiarism Detection” Computer Engineering Department,European University of Lefke, North Cyprus
4) S. Schleimer, D. Wilkerson, and A. Aiken, “Winnowing: Local Algorithms for Document Fingerprinting,” Proc. the ACM SIGMOD Int’l Conf. Management of Data, pp. 76-85, 2003
04/13/2023 29
References4) M.J. Wise, “YAP3: Improved Detection of Similarities in Computer
Program and Other Texts,” Proc. 27th SIGCSE Technical Symp., pp. 130-134, 1996.
04/13/2023 30
THANK U!!!