32
A Text Alignment Corpus for Persian Plagiarism Detection Fatemeh Mashhadirajab, Mehrnoush Shamsfard Razieh Adelkhah, Fatemeh Shafiee, Chakaveh Saedi NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University Persian Plagdet 2016

A Text Alignment Corpus for Persian Plagiarism …A Text Alignment Corpus for Persian Plagiarism Detection Fatemeh Mashhadirajab, Mehrnoush Shamsfard Razieh Adelkhah, Fatemeh Shafiee,

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

A Text Alignment Corpus for Persian Plagiarism Detection

Fatemeh Mashhadirajab, Mehrnoush Shamsfard

Razieh Adelkhah,

Fatemeh Shafiee, Chakaveh Saedi

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

Persian Plagdet 2016

Outline

Introduction

Text Alignment Corpus Construction

Strategies For Plagiarisms Types

Dataset Statistics

Conclusions

References

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

2 /29

Introduction

3 /29

A taxonomy of plagiarism [2]

Text Alignment Corpus Construction

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

4 /29

Documents Clustering

Source Documents Segmentation

Segment Extraction

Segment Obfuscation

Obfuscated Segment Insertion

Set of Suspicious and source Documents

source and suspicious document pairs selection

Data Source Preparation

Text Alignment Corpus Construction

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

5 /29

Data Source Preparation

articles or theses in the fields of computer science and engineering & electrical engineering

o 1,500 documents from articles and theses available from online stores

o 4,500 documents from Wikipedia articles

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

5 /29

Our corpus contains 11,089 documents

Data Source Preparation

Text Alignment Corpus Construction

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

6 /29

Documents Clustering

Text Alignment Corpus Construction

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

7 /29

Set of Suspicious and source Documents

Text Alignment Corpus Construction

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

8 /29

Set of Suspicious and source Documents Suspicious Documents Source Documents

They are randomly selected from each cluster

Text Alignment Corpus Construction

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

9 /29

source and suspicious document

pairs selection

Suspicious Documents Source Documents

Text Alignment Corpus Construction

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

9 /29

source and suspicious

document pairs selection

if the

similarity <

50%

Similarity

Detection

system

susp

Suspicious Documents Source Documents

Text Alignment Corpus Construction

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

9 /29

source and suspicious

document pairs selection

if the

similarity <

50%

Similarity

Detection

system

susp src

Suspicious Documents Source Documents

Text Alignment Corpus Construction

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

10/29

Source Documents Segmentation

if the

similarity <

50%

Similarity

Detection

system

src

Suspicious Documents Source Documents

Text Alignment Corpus Construction

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

11/29

Segment Extraction

if the

similarity <

50%

Similarity

Detection

system

src

Suspicious Documents Source Documents

Text Alignment Corpus Construction

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

12/29

Segment Obfuscation

if the

similarity <

50%

Similarity

Detection

system

src

Segment

Obfuscation

Suspicious Documents Source Documents

Text Alignment Corpus Construction

susp

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

13/29

Obfuscated Segment Insertion

if the

similarity <

50%

Similarity

Detection

system

src

Segment

Obfuscation

Suspicious Documents Source Documents

Text Alignment Corpus Construction

Strategies For Plagiarisms Types

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

14/29

Exact Copy

Near Copy

Modified Copy

Text Manipulation (Paraphrasing)

Text Manipulation (Summarizing)

Automatic Translation

Manual Translation

Cyclic Translation

Idea Adoption (semantic-based meaning)

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

15/29

Exact Copy

Segment Obfuscation

susp src

Strategies For Plagiarisms Types

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

16/29

Near Copy

Segment Obfuscation

susp src

Insertion

deletion

substitution

sentence split or join

Strategies For Plagiarisms Types

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

17/29

Segment Obfuscation

susp src

the Persian sentence understanding

and generation system introduced

by Adelkhah et al. [7]

Modified Copy

semantic representation

(sentence understanding)

sentence production based on semantic

representation (sentence generation)

Strategies For Plagiarisms Types

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

18/29

Segment Obfuscation

susp src

the Persian sentence understanding

and generation system introduced

by Adelkhah et al. [7]

Text Manipulation (Paraphrasing)

Each word is replaced with a

synonym retrieved from FarsNet or

FavaNet

Strategies For Plagiarisms Types

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

19/29

Segment Obfuscation

susp src

Text Manipulation (Summarizing)

Persian summarizer

introduced by Shafiee et al. [6]

Strategies For Plagiarisms Types

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

20/29

Segment Obfuscation

susp src

Automatic Translation

Google translate

Persian to English Hunspell

Spell checker

Persian English

Strategies For Plagiarisms Types

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

21/29

Segment Obfuscation

susp src

Manual Translation

Translateion

Persian English

Strategies For Plagiarisms Types

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

22/29

Segment Obfuscation

susp src

Cyclic Translation

Persian English

Google translate

Negar spell checker

Google translate

Hunspell

Strategies For Plagiarisms Types

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

23/29

Segment Obfuscation

susp src

Idea Adoption (semantic-based meaning)

Strategies For Plagiarisms Types

Dataset Statistics

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

24/29

documents 11089

plagiarism cases 11603

Document purpose

languages fa

source documents 48%

suspicious documents

with plagiarism 28%

w/o plagiarism 24%

Document length

short (<10 pages1) 64%

medium (10-100 pages) 35%

long (>100 pages) 1%

Plagiarism per document

hardly (<20%) 25%

medium (20%-50%) 20%

much (50%-80%) 26%

entirely (>80%) 29%

Case length

short (<1k characters) 37%

medium (1k-3k characters) 55%

long (>3k characters) 8%

Obfuscation synthesis approaches

Exact Copy 8%

Near Copy 12%

Modified Copy 12%

Paraphrasing 16%

Summary 13%

Manual Translation 18%

Automatic Translation 8%

Cyclic Translation 10%

semantic-based meaning 3%

1 A page is measured as 1500 chars.

Conclusion

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

25/29

This article describes a methodology for building a Persian corpus for

evaluating plagiarism detection systems.

This corpus is in PAN format.

To produce this corpus, the focus is on the simulation of different types of

plagiarism

Different strategies are employed to create obfuscation in each plagiarism

category

This corpus is a variety of plagiarism types in large volume are created.

References

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

26/29

1. Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Potthast, M. and Rosso, P. 2016. Overview of the

PAN@FIRE2016 Shared Task on Persian Plagiarism Detection and Text Alignment Corpus

Construction, Notebook Papers of FIRE 2016, FIRE-2016, CEUR-WS.org.

2. Alzahrani, M., Salim, N. and Abraham, A. 2012. Understanding plagiarism linguistic patterns,

Textual features, and detection Methods. IEEE Trans. SYSTEMS, MAN, AND CYBERNETICS—

PART C: APPLICATIONS AND REVIEWS, vol. 42, no. 2.

3. Potthast, M., Stein, B. and et.al. 2010. An Evaluation Framework for Plagiarism Detection.

Proceedings of the 23rd International Conference on Computational Linguistics, COLING 2010

Beijing,_c ACL.

4. Shamsfard, M., and Kiani, S., and Shahedi, Y. STeP-1: Standard Text Preparation for Persian

Language. CAASL3 Third Workshop on Computational Approaches to Arabic Script- Languages.

5. Makrehchi, M. and Kamel, M. 2004. A fuzzy set approach to extracting keywords from abstracts.

North American Fuzzy Information Processing Society- NAFIPS 2003, Banf, Canada.

6. Shafiee, F. and Shamsfard, M. 2015. The automatic Persian summarizer. The 20st Computer

Society of Iran computer conference.

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

27/29

7. Adelkhah, R., Sadeghi, R. and Shamsfard, M. 2016. Persian sentence understanding and

generation: a mutual conversion. The 21st Computer Society of Iran computer conference.

8. Potthast, M., Göring, S. and et.al. 2015. Towards Data Submissions for Shared Task: First

Experiences for the Task of Text Alignment. Working Notes Papers of the CLEF 2015 Evaluation

Labs, CEUR Workshop Proceedings, (September 2015), ISSN 1613-0073.

9. Potthast, M., Hagen, M., Gollub, T. and et.al. 2013. Overview of the 5th International Competition

on Plagiarism Detection”, Working Notes Papers of the CLEF 2013Evaluation Labs and Workshop,

(September 2013), ISBN 978-88-904810-3-1.

10.Manku, G. S., Jain, A. and Sarma, A. D .2007. Detecting NearDuplicates for Web Crawling. Data

mining.

11.Kamran, K., Ahmadi, A. and Kazemivanhari, F. 2013. Plagiarism detection in Persian text using

Fingerprint algorithms. The 21st Iranian Conference on Electrical Engineering.

12.Davarpanah, M. R., sanji, M. and Aramideh, M. 2009. Farsi Lexical Analysis and StopWord List.

Library Hi Tech, vol. 27, pp 435–449.

References

NLP Research Lab, Faculty of Computer Science and Engineering, Shahid Beheshti University

28/29

13. Iran Telecommunication Research Center (ITRC), 2013. Buali Sina University.

http://217.218.62.234:8080/.

14.Shamsfard, M., Hesabi, A., Fadaei H. and et.al 2010. Semi Automatic Development of FarsNet;

The Persian WordNet. Proceedings of 5th Global WordNet Conference.

15.Mashhadirajab, F. and Shamsfard, M. 2014. Plagiarism Detection in Persian documents. Master's

thesis. Shahid Beheshti University.

16.Potthast, M., Eiselt, A and et.al. 2011. Overview of the 3rd International Competition on Plagiarism

Detection. Notebook Papers of CLEF 2011 Labs and Workshops, (September 2011), ISBN 978-88-

904810-1-7.

17.Potthast, M., Gollub, T. and et.al. 2012. Overview of the 4th International Competition on

Plagiarism Detection. CLEF 2012 Evaluation Labs and Workshop – Working Notes Papers,

(September 2012), ISBN 978-88-904810-3-1.

18.Potthast, M., Hagen, M. and et.al. 2014. Overview of the 6th International Competition on

Plagiarism Detection. CLEF 2014 Evaluation Labs and Workshop – Working Notes Papers,

(September 2014).

References