Plagiarism Detection

  • Upload
    gyugam

  • View
    50

  • Download
    0

Embed Size (px)

DESCRIPTION

Discussion of Plagiarism detection Techniques.

Citation preview

Plagiarism DetectionINTRODUCTION

Plagiarism refers to the act of copying materials (text/images/code) without actually acknowledging the original source. Plagiarism has seen a widespread activity in the recent times. The increase in the number of materials available now in the electronic form and the easy access to the internet has increased plagiarism. Availability of digital documents (for instance, easy access to the Web) and telecommunications in general open good chances for plagiarism prosperity turning cheating into extremely easy and engaging process. Nowadays plagiarism has turned into a serious problem for publishers, researchers and educators.

Manual detection of plagiarism is not very easy and is time consuming due to the vast amount of contents available. Plagiarism Detection Techniques are available now which help us to detect plagiarism. Current research is in the field of development of algorithms that can compare and detect plagiarism.

REDUCING PLAGIARISM

The methods to fight against plagiarism can be grouped into two classes: Methods for Plagiarism prevention Methods for Plagiarism DetectionPlagiarism prevention includes honesty policies or punishment systems for plagiarised work. Plagiarism detection includes software tools to reveal plagiarism automatically. Each method has its own pros and cons. Plagiarism prevention is difficult to implement but the effect is long term even though it does not come immediately. On the other hand, Plagiarism Detection methods can be implemented in a shorter time but the positive effect is momentary. We will discuss the second class of methods in detail. Plagiarism Detection Methods

Plagiarism detection methods are usually based on comparison of two or more documents. This comparison can be either manual or software-assisted. Manual detection requires substantial effort and excellent memory, and is impractical in cases where too many documents must be compared, or original documents are not available for comparison. Software-assisted detection allows vast collections of documents to be compared to each other, making successful detection much more likely.Culwin and Lancaster have defined plagiarism detection as a four stage process.Collection Analysis Verification InvestigationCollection stage: Electronically collecting and pre-processing of submissionsAnalysis stage: Submissions compared with each other as well as documents obtained from webVerification stage: Suspicious pairs of documents are investigated for possible disciplinary actionsInvestigation stage: To determine the extent of alleged misconduct and deciding culpability

Software Based Detection Systems

Software systems designed to detect plagiarism implement either external detection approach or intrinsic detection approach. External Detection Systems: Compare a suspicious document with a reference collection (a set of documents assumed to be genuine).Intrinsic detection systems: Solely analyse the text and recognize changes in unique writing style of an author as an indicator for potential plagiarism.

FINGERPRINTINGFingerprinting is currently the most widely applied approach to plagiarism detection. This method forms representative digests of documents by selecting a set of multiple substrings (k-grams) from them. A k-gram is a contiguous substring of length k. We divide a document into k-grams, where k is a parameter chosen by the user. The sets represent the fingerprints and their elements are called minutiae.

The steps to generate k-grams of a text are as follows:

Step 1: Remove the irrelevant features like spaces and punctuation marks.Step 2: Fix a value of k and generate k-grams of the string.Step 3: Hash the k-grams and select a particular subset of the k-grams to be documents fingerprint.Step 4: Check for plagiarism if the reference document and the suspicious document share minutiae more than a threshold.

Generate 4-grams of the following string: My name is abc xyz.

(a)String after Step 1: MyNameisabcxyz

(b) 4-grams of the string: MyNa yName Name amei meis eisa isab sabc abcx bcxy cxyz.

(c) Generate a random hash sequence of the 4-grams.

(d) Select a particular subset of the hashes (usually 0 mod p) and check for potential plagiarism.

Only a subset of the hashes should be retained as the documents fingerprints for efficiency. One popular approach is to choose all hashes that are 0 mod p, for some fixed p. This approach is easy to implement and retains only 1/p of all hashes as fingerprints.

A suspicious document is checked for plagiarism by computing its fingerprint and querying minutiae with a pre-computed index of fingerprints for all documents of a reference collection. Minutiae matching with those of other documents indicate shared text segments and suggest potential plagiarism if they exceed a chosen similarity threshold.

Computational resources and time are limiting factors to fingerprinting, which is why this method typically only compares a subset of minutiae to speed up the computation and allow for checks in very large collection, such as the Internet.

STRING MATCHINGString Matching is a prevalent approach used in computer science. When applied to the problem of plagiarism detection, documents are compared for verbatim text overlaps (i.e. using exactly the same words). Numerous methods have been proposed to tackle this task, of which some have been adapted to external plagiarism detection. Checking a suspicious document in this setting requires the computation and storage of efficiently comparable representations for all documents in the reference collection to compare those pair wise. Generally, suffix document models, such as suffix-trees or suffix vectors, have been used for this task.A Suffix Tree for a given text is a compressed trie for all suffixes of the given text. A trie, also called digital tree and sometimes radix tree or prefix tree since they can be searched by prefixes, is an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.

For example, for the following array of words

{bear, bell, bid, bull, buy, sell, stock, stop},

The suffix tree can be constructed as

Compressed Trie is obtained from standard trie by joining chains of single nodes. The nodes of a compressed trie can be stored by storing index ranges at the nodes. Following is the compressed trie.

Following are abstract steps to search a pattern in the built Suffix Tree.

(1) Starting from the first character of the pattern and root of Suffix Tree, do following for every character. For the current character of pattern, if there is an edge from the current node of suffix tree, follow the edge. If there is no edge, then the pattern doesnt exist in text.

(2) If all characters of pattern have been processed, i.e., there is a path from root for characters of the given pattern, then the Pattern is found.

Let us consider the example pattern as nan to see the searching process.Following are all suffixes of banana\0 banana\0anana\0nana\0ana\0na\0a\0\0

Following diagram shows the path followed for searching nan or nana.

Working

Every pattern that is present in text (or we can say every substring of text) must be a prefix of one of all possible suffixes. So, the suffix tree should find it if it exists.

Nonetheless, substring matching remains computationally expensive, which makes it a non-viable solution for checking large collections of documents.

BAG OF WORDSBag of words analysis represent the adoption of vector space retrieval, a traditional IR concept, to the domain of plagiarism detection. Documents are represented as one or multiple vectors, e.g. for different document parts, which are used for pair wise similarity computations. Each document is a bag of words, meaning that it assumes order of words has no significance (the term home made no significance (the term home made has the same probability as made home)).

Similarity computation may then rely on the traditional cosine similarity measure, or on more sophisticated similarity measures. Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0 is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a Cosine similarity of 1, two vectors at 90 have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0, 1].

Given two documents, and a pre-defined list of words appearing in the documents (the dictionary), we can compute the vectors of frequencies (x,y) of the words as they appear in the documents. The angle between the two vectors is a widely used measure of closeness (similarity) between documents.

The following models a text document using bag-of-words.Here are two simple text documents: John likes to watch movies. Mary likes movies too.

John also likes to watch football games.

Based on these two text documents, a dictionary is constructed as:

which has 10 distinct words. And using the indexes of the dictionary, each document is represented by a 10-entry vector:[1, 2, 1, 1, 2, 0, 0, 0, 1, 1][1, 1, 1, 1, 0, 1, 1, 1, 0, 0]where each entry of the vectors refers to count of the corresponding entry in the dictionary (this is also the histogram representation). This vector representation does not preserve the order of the words in the original sentences.

CITATION ANALYSISCitation-based plagiarism detection (CbPD) relies on citation analysis, and is the only approach to plagiarism detection that does not rely on the textual similarity. CbPD examines the citation and reference information in texts to identify similar patterns in the citation sequences. As such, this approach is suitable for scientific texts, or other academic documents that contain citations. Citation Order Analysis (COA), is similar to bibliographic coupling, but also analyses the order ofcitations within the document. This allows the creation of a citation-based digital fingerprint. By using tolerant sequence analysis algorithms, such as the Levenshtein distance, plagiarized text can also be detected if the order of citations has been slightly changed.These steps are performed in the plagiarism detection system:1. The document is parsed and a series of heuristics applied to process the citations, including their position within the document.2. Citations are matched with their entries in the bibliography.3. The citation-based similarity of the documents is calculated. In the basic version, only the order is considered; in the more advanced version, the distance between two citations is evaluated as well. Even if a document is translated, the order of citations within sentences or paragraphs might change due to different sentence structures or writing styles

The underlying assumption is that the closer the citations are to each other, the more likely it is that they are related. Based on this proximity analysis, the CPI is calculated. If for example two citations are given in the same sentence, the probability that they are related is higher (CPI = 1) than if they are cited only within the same paragraph (CPI = ).

Citation analysis to detect plagiarism is a relatively young concept. It has not been adopted by commercial software, but a first prototype of a citation-based plagiarism detection system exists. Similar order and proximity of citations in the examined documents are the main criteria used to compute citation pattern similarities. Citation patterns represent sub sequences non-exclusively containing citations shared by the documents compared. Factors, including the absolute number or relative fraction of shared citations in the pattern, as well as the probability that citations co-occur in a document are also considered to quantify the patterns degree of similarity.

STYLOMETRYStylometry is a kind of study by which a person can judge about another person by its writing style. An example is discussed; any experienced can apply a kind of stylometry. Which of the example was written more recently? Are there are two authors or only one? Which example was written by a native English speaker?Stylometry subsumes statistical methods for quantifying an authors unique writing style and is mainly used for authorship attribution or intrinsic CaPD. By constructing and comparing stylometric models for different text segments, passages that are stylistically different from others, hence potentially plagiarized, can be detected. The analysis of the texts for evidence of authenticity, authorial identity has also increased the stylometry techniques. The English professor John Burrows concluded that the intellectual propensities of the authors display inherently and written texts have a particular style. If we dont know the authorship of the word-use patterns in a text and then comparing and contrasting those patterns to the patterns in texts of known authorship, the similarity and dissimilarities of the textual patterns can provide supporting evidence for or contradicting evidence against an assertion of authorship.Stylometric tasks include predicting whether a paper is written: By a native or non-native speaker. By a male or a female. In style of a conference or a workshop.Forensic linguistics has a sub-field that is forensic stylistics and the author identification can be done by applying stylistics. The stylistic is based on two premises: Two writers do not write in the same pattern (having same mother-tongue). The writer itself does not write in the same pattern all the time.

The stylistic can be categories into two different approaches: Qualitative QuantitativeIn the qualitative approach errors and personal behaviour of the authors are assessed whereas in the quantitative approach focus on readily computable and countable language features, e.g. length of word, length of sentence, phrase length, frequency of vocabulary, distribution of words of different lengths. CONCLUSION

This paper discusses various techniques which can be used to detect plagiarism. Plagiarism is rampant now. With most of the data available to us in digital format the venues for plagiarism is opening up. To avoid this kind of cheating and to acknowledge the originality of the author new detection techniques are to be created. Not only systems with speed but also new systems should which can be able to collect information about plagiarism in the web or large repositories. As there are a large number of detection tools available for text based plagiarism the number of copying incidents have reduced considerably in this field. Currently we use a lot of computer based applications. To protect the intellectual property in the documents, new techniques are to be developed and implemented.

REFERENCES

1. en.wikipedia.org/wiki/Plagiarism_detection2. www.plagiarism.org3. http://www.academia.edu/1513526/Citation_Based_Plagiarism_Detection_-_A_New_Approach_to_Identify_Plagiarized_Work_Language_Independently4. http://en.wikipedia.org/wiki/Bag-of-words_model5. http://en.wikipedia.org/wiki/Information_retrieval