Upload
others
View
14
Download
0
Embed Size (px)
Citation preview
A Text Alignment Corpus for Persian Plagiarism Detection
Fatemeh Mashhadirajab, Mehrnoush Shamsfard
Razieh Adelkhah,
Fatemeh Shafiee, Chakaveh Saedi
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
Persian Plagdet 2016
Outline
Introduction
Text Alignment Corpus Construction
Strategies For Plagiarisms Types
Dataset Statistics
Conclusions
References
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
2 /29
Text Alignment Corpus Construction
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
4 /29
Documents Clustering
Source Documents Segmentation
Segment Extraction
Segment Obfuscation
Obfuscated Segment Insertion
Set of Suspicious and source Documents
source and suspicious document pairs selection
Data Source Preparation
Text Alignment Corpus Construction
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
5 /29
Data Source Preparation
articles or theses in the fields of computer science and engineering & electrical engineering
o 1,500 documents from articles and theses available from online stores
o 4,500 documents from Wikipedia articles
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
5 /29
Our corpus contains 11,089 documents
Data Source Preparation
Text Alignment Corpus Construction
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
6 /29
Documents Clustering
Text Alignment Corpus Construction
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
7 /29
Set of Suspicious and source Documents
Text Alignment Corpus Construction
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
8 /29
Set of Suspicious and source Documents Suspicious Documents Source Documents
They are randomly selected from each cluster
Text Alignment Corpus Construction
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
9 /29
source and suspicious document
pairs selection
Suspicious Documents Source Documents
Text Alignment Corpus Construction
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
9 /29
source and suspicious
document pairs selection
if the
similarity <
50%
Similarity
Detection
system
susp
Suspicious Documents Source Documents
Text Alignment Corpus Construction
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
9 /29
source and suspicious
document pairs selection
if the
similarity <
50%
Similarity
Detection
system
susp src
Suspicious Documents Source Documents
Text Alignment Corpus Construction
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
10/29
Source Documents Segmentation
if the
similarity <
50%
Similarity
Detection
system
src
Suspicious Documents Source Documents
Text Alignment Corpus Construction
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
11/29
Segment Extraction
if the
similarity <
50%
Similarity
Detection
system
src
Suspicious Documents Source Documents
Text Alignment Corpus Construction
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
12/29
Segment Obfuscation
if the
similarity <
50%
Similarity
Detection
system
src
Segment
Obfuscation
Suspicious Documents Source Documents
Text Alignment Corpus Construction
susp
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
13/29
Obfuscated Segment Insertion
if the
similarity <
50%
Similarity
Detection
system
src
Segment
Obfuscation
Suspicious Documents Source Documents
Text Alignment Corpus Construction
Strategies For Plagiarisms Types
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
14/29
Exact Copy
Near Copy
Modified Copy
Text Manipulation (Paraphrasing)
Text Manipulation (Summarizing)
Automatic Translation
Manual Translation
Cyclic Translation
Idea Adoption (semantic-based meaning)
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
15/29
Exact Copy
Segment Obfuscation
susp src
Strategies For Plagiarisms Types
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
16/29
Near Copy
Segment Obfuscation
susp src
Insertion
deletion
substitution
sentence split or join
Strategies For Plagiarisms Types
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
17/29
Segment Obfuscation
susp src
the Persian sentence understanding
and generation system introduced
by Adelkhah et al. [7]
Modified Copy
semantic representation
(sentence understanding)
sentence production based on semantic
representation (sentence generation)
Strategies For Plagiarisms Types
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
18/29
Segment Obfuscation
susp src
the Persian sentence understanding
and generation system introduced
by Adelkhah et al. [7]
Text Manipulation (Paraphrasing)
Each word is replaced with a
synonym retrieved from FarsNet or
FavaNet
Strategies For Plagiarisms Types
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
19/29
Segment Obfuscation
susp src
Text Manipulation (Summarizing)
Persian summarizer
introduced by Shafiee et al. [6]
Strategies For Plagiarisms Types
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
20/29
Segment Obfuscation
susp src
Automatic Translation
Google translate
Persian to English Hunspell
Spell checker
Persian English
Strategies For Plagiarisms Types
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
21/29
Segment Obfuscation
susp src
Manual Translation
Translateion
Persian English
Strategies For Plagiarisms Types
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
22/29
Segment Obfuscation
susp src
Cyclic Translation
Persian English
Google translate
Negar spell checker
Google translate
Hunspell
Strategies For Plagiarisms Types
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
23/29
Segment Obfuscation
susp src
Idea Adoption (semantic-based meaning)
Strategies For Plagiarisms Types
Dataset Statistics
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
24/29
documents 11089
plagiarism cases 11603
Document purpose
languages fa
source documents 48%
suspicious documents
with plagiarism 28%
w/o plagiarism 24%
Document length
short (<10 pages1) 64%
medium (10-100 pages) 35%
long (>100 pages) 1%
Plagiarism per document
hardly (<20%) 25%
medium (20%-50%) 20%
much (50%-80%) 26%
entirely (>80%) 29%
Case length
short (<1k characters) 37%
medium (1k-3k characters) 55%
long (>3k characters) 8%
Obfuscation synthesis approaches
Exact Copy 8%
Near Copy 12%
Modified Copy 12%
Paraphrasing 16%
Summary 13%
Manual Translation 18%
Automatic Translation 8%
Cyclic Translation 10%
semantic-based meaning 3%
1 A page is measured as 1500 chars.
Conclusion
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
25/29
This article describes a methodology for building a Persian corpus for
evaluating plagiarism detection systems.
This corpus is in PAN format.
To produce this corpus, the focus is on the simulation of different types of
plagiarism
Different strategies are employed to create obfuscation in each plagiarism
category
This corpus is a variety of plagiarism types in large volume are created.
References
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
26/29
1. Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Potthast, M. and Rosso, P. 2016. Overview of the
PAN@FIRE2016 Shared Task on Persian Plagiarism Detection and Text Alignment Corpus
Construction, Notebook Papers of FIRE 2016, FIRE-2016, CEUR-WS.org.
2. Alzahrani, M., Salim, N. and Abraham, A. 2012. Understanding plagiarism linguistic patterns,
Textual features, and detection Methods. IEEE Trans. SYSTEMS, MAN, AND CYBERNETICS—
PART C: APPLICATIONS AND REVIEWS, vol. 42, no. 2.
3. Potthast, M., Stein, B. and et.al. 2010. An Evaluation Framework for Plagiarism Detection.
Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010
Beijing,_c ACL.
4. Shamsfard, M., and Kiani, S., and Shahedi, Y. STeP-1: Standard Text Preparation for Persian
Language. CAASL3 Third Workshop on Computational Approaches to Arabic Script- Languages.
5. Makrehchi, M. and Kamel, M. 2004. A fuzzy set approach to extracting keywords from abstracts.
North American Fuzzy Information Processing Society- NAFIPS 2003, Banf, Canada.
6. Shafiee, F. and Shamsfard, M. 2015. The automatic Persian summarizer. The 20st Computer
Society of Iran computer conference.
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
27/29
7. Adelkhah, R., Sadeghi, R. and Shamsfard, M. 2016. Persian sentence understanding and
generation: a mutual conversion. The 21st Computer Society of Iran computer conference.
8. Potthast, M., Göring, S. and et.al. 2015. Towards Data Submissions for Shared Task: First
Experiences for the Task of Text Alignment. Working Notes Papers of the CLEF 2015 Evaluation
Labs, CEUR Workshop Proceedings, (September 2015), ISSN 1613-0073.
9. Potthast, M., Hagen, M., Gollub, T. and et.al. 2013. Overview of the 5th International Competition
on Plagiarism Detection”, Working Notes Papers of the CLEF 2013Evaluation Labs and Workshop,
(September 2013), ISBN 978-88-904810-3-1.
10.Manku, G. S., Jain, A. and Sarma, A. D .2007. Detecting NearDuplicates for Web Crawling. Data
mining.
11.Kamran, K., Ahmadi, A. and Kazemivanhari, F. 2013. Plagiarism detection in Persian text using
Fingerprint algorithms. The 21st Iranian Conference on Electrical Engineering.
12.Davarpanah, M. R., sanji, M. and Aramideh, M. 2009. Farsi Lexical Analysis and StopWord List.
Library Hi Tech, vol. 27, pp 435–449.
References
NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University
28/29
13. Iran Telecommunication Research Center (ITRC), 2013. Buali Sina University.
http://217.218.62.234:8080/.
14.Shamsfard, M., Hesabi, A., Fadaei H. and et.al 2010. Semi Automatic Development of FarsNet;
The Persian WordNet. Proceedings of 5th Global WordNet Conference.
15.Mashhadirajab, F. and Shamsfard, M. 2014. Plagiarism Detection in Persian documents. Master's
thesis. Shahid Beheshti University.
16.Potthast, M., Eiselt, A and et.al. 2011. Overview of the 3rd International Competition on Plagiarism
Detection. Notebook Papers of CLEF 2011 Labs and Workshops, (September 2011), ISBN 978-88-
904810-1-7.
17.Potthast, M., Gollub, T. and et.al. 2012. Overview of the 4th International Competition on
Plagiarism Detection. CLEF 2012 Evaluation Labs and Workshop – Working Notes Papers,
(September 2012), ISBN 978-88-904810-3-1.
18.Potthast, M., Hagen, M. and et.al. 2014. Overview of the 6th International Competition on
Plagiarism Detection. CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers,
(September 2014).
References