1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen,...

GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis

Chao Liu, Chen Chen,

Jiawei Han, Philip S. Yu

University of Illinois at Urbana-Champaign

IBM T.J. Waston Research Center

Presented by Chao Liu

Motivations Blossom of open-source projects

SourceForge.net: 125,090 projects as July 2006 Convenience for software plagiarism?

You can always find something online Core-part plagiarism

Ripping off GUIs and irrelevant parts (Illegally) reuse the implementations of core-

algorithms Our goal

Efficient detection of core-part plagiarism

Challenges

Effectiveness Professional plagiarists Automated plagiarism

Efficiency Only a small part of code is plagiarized, how

to detect it efficiently?

Outline

Plagiarism Disguises Review of Plagiarism Detection GPLAG: PDG-based Plagiarism Detection Efficiency and Scalability Experiments Conclusions

Original Program

01 static void

02 make_blank (struct line *blank, int count)

04 int i;

05 unsigned char *buffer;

06 struct field *fields;

07 blank->nfields = count;

08 blank->buf.size = blank->buf.length = count + 1;

09 blank->buf.buffer = (char*) xmalloc (blank->buf.size);

10 buffer = (unsigned char *) blank->buf.buffer;

11 blank->fields = fields =

(struct field *) xmalloc (sizeof (struct field) * count);

12 for (i = 0; i < count; i++){

13 ...

A procedure in a program, called join

Disguise 1: Format Alteration

01 static void

02 make_blank (struct line *blank, int count)

04 int i;

07 blank->nfields = count; // initialization

08 blank->buf.size = blank->buf.length = count + 1;

09 blank->buf.buffer = (char*) xmalloc (blank->buf.size);

10 buffer = (unsigned char *) blank->buf.buffer;

11 blank->fields = fields =

12 for (i = 0; i < count; i++){

13 ...

Insert comments and blanks

Disguise 2: Identifier Renaming

01 static void

02 fill_content (struct line *fill, int num)

04 int i;

07 fill->nfields = num; // initialization

08 fill->buf.size = fill->buf.length = num + 1;

09 fill->buf.buffer = (char*) xmalloc (fill->buf.size);

10 buffer = (unsigned char *) fill->buf.buffer;

11 fill->fields = fields =

(struct field *) xmalloc (sizeof (struct field) * num);

12 for (i = 0; i < num; i++){

13 ...

Rename variables consistently

Disguise 3: Statement Reordering

01 static void

04 int i;

12 for (i = 0; i < num; i++){

13 ...

Reorder non-dependent statements

Disguise 4: Control Replacement

01 static void

04 int i;

12 i = 0;

13 while (i < num){

14 ...

15 i++;

Use equivalent control structure

Disguise 5: Code Insertion

01 static void

04 int i;

12 i = 0;

13 while (i < num){

14 ... for (int j = 0; j < i; j++);

15 i++;

Insert immaterial code

Fully Disguised

01 static void02 make_blank (struct line *blank, int count)03 {04 int i;05 unsigned char *buffer;06 struct field *fields;

07 blank->nfields = count;08 blank->buf.size = blank->buf.length = count + 1;09 blank->buf.buffer = (char*) xmalloc (blank->buf.size);10 buffer = (unsigned char *) blank->buf.buffer;11 blank->fields = fields =

12 for (i = 0; i < count; i++){13 ...14 }15 }

Original C ode

01 static void02 fill_content(int num, struct line* fill)03 {04 (*fill).store.size = fill->store.length = num + 1;05 struct field *tabs;06 (*fill).fields = tabs = (struct field *) xmalloc (sizeof (struct field) * num);07 (*fill).store.buffer = (char*) xmalloc (fill->store.size);08 (*fill).ntabs = num;09 unsigned char *pb;10 pb = (unsigned char *) (*fill).store.buffer;

11 int idx = 0;12 while(idx < num){ // fill in the storage13 ...14 for(int j = 0; j < idx; j++)15 ...16 idx++;17 }18 }

P lagiar ized C ode

Outline

Review of Plagiarism Detection String-based [Baker et al. 1995]

A program represented as a string Blanks and comments ignored.

AST-based [Baxter et al. 1998, Kontogiannis et al. 1995] A program is represented as an Abstract Syntax Tree (AST) Fragile to statement reordering, control replacement and

code insertion Token-based [Kamiya et al. 2002, Prechelt et al. 2002]

Variables of the same type are mapped to the same token A program is represented as a token string Fingerprint of token strings is used for robustness [Schleimer

et al. 2003] Partially robust to statement reordering, control replacement

and code insertion Representatives: Moss and JPlag

Outline

Graphic representation of source code

int sum(int array[], int count)

int i, sum;

sum = 0;

for(i = 0; i < count; i++){

sum = add(sum, array[i]);

return sum;

int add(int a, int b)

return a + b;

Graphic representation of source code

int i, sum;

sum = 0;

return sum;

return a + b;

Control Dependency

int i, sum;

sum = 0;

return sum;

return a + b;

Data Dependency

int i, sum;

sum = 0;

return sum;

return a + b;

Plagiarism Detectible?

01 static void02 make_blank (struct line *blank, int count)03 {04 int i;05 unsigned char *buffer;06 struct field *fields;

07 blank->nfields = count;08 blank->buf.size = blank->buf.length = count + 1;09 blank->buf.buffer = (char*) xmalloc (blank->buf.size);10 buffer = (unsigned char *) blank->buf.buffer;11 blank->fields = fields =

12 for (i = 0; i < count; i++){13 ...14 }15 }

Original C ode

01 static void02 fill_content(int num, struct line* fill)03 {04 (*fill).store.size = fill->store.length = num + 1;05 struct field *tabs;06 (*fill).fields = tabs = (struct field *) xmalloc (sizeof (struct field) * num);07 (*fill).store.buffer = (char*) xmalloc (fill->store.size);08 (*fill).ntabs = num;09 unsigned char *pb;10 pb = (unsigned char *) (*fill).store.buffer;

11 int idx = 0;12 while(idx < num){ // fill in the storage13 ...14 for(int j = 0; j < idx; j++)15 ...16 idx++;17 }18 }

P lagiar ized C ode

Corresponding PDGs

3: dec l.,line* blank

8: dec l.,int c ount

12: dec l.,int i

13: ass ign,i = 0

14: inc .,i++

15: c ontro li < c ount

9: as s ig n ,b lan k->b u f.s iz e

= b lan k->...

7: as s ig n ,b lan k->n field s =

co u n t

4: as s ig n ,b lan k->b u f.b u ffer = (ch ai*) xm..

0: as s ig n ,b lan k->field s =

field s = ...

10: as s ig n , b u ffer= (u n s ig n ed ) ...

11: d ec l.,c har* b uffer

5: d ec l.,s tru c t field *

field s

1: as s ig n ,field s =

(s tru c t ...

2: c all-s ite,xmalloc ()

3: dec l.,l ine* fi l l

8: dec l.,int num

12: dec l.,int idx

13: ass ign,idx = 0

14: inc .,idx++

15: c ontro lw hile(id x < num)

9: as s ig n ,(*fill).s to re.s iz e

7: as s ig n ,(*fill).n tab s =

4: as s ig n ,(*fill).s to re.b u f =

(ch ar*) ...

0: as s ig n ,(*field ).field s =

tab = ...

10: as s ig n , p b =(u n s ig n ed

ch ar*) (*fill)...

11: dec l.,c har* pb

5: d ec l.,s tru c t field *

1: as s ig n ,tab s = (s tru c t

16: dec l.,int j

17: ass ign,j = 0

18: inc .,j++

19: c ontro lj < idx

PDG for the Original Code PDG for the Plagiarized Code

PDG-based Plagiarism Detection

A program is represented as a set of PDGs Let g be a PDG of Procedure P in the original program Let g’ be a PDG of Procedure P’ in the plagiarism suspect

Subgraph isomorphism implies plagiarism If g is subgraph isomorphic to g’, P’ is likely plagiarized

from P γ-isomorphism: Graph g is γ-isomorphic to g’ if there

exists a subgraph s of g such that s is subgraph isomorphic to g’, and |s|≥ γ |g|.

If g is γ–isomorphic to g’, the PDG pair (g, g’) is regarded as a plagiarized PDG pair, and is then returned to human beings for examination.

Advantages

Robust because it is hard to overhaul PDGs Dependencies encode program logic Incentive of plagiarism

Outline

Efficiency and Scalability

Search space If the original program has n procedures and

the plagiarism suspect has m procedures n*m subgraph isomorphism testings

Pruning search space Lossless filter Statistical lossy filter

Lossless filter

Interestingness PDGs smaller than an interesting

size K are excluded from both sides

γ-isomorphism definition A PDG pair (g, g’) is discarded if |

g’| <γ|g|.

Lossy Filter

Observation If procedure P’ is plagiarized from

procedure P, its PDG g’ should look similar to g.

So discard those dissimilar PDG pairs Requirement

This filter must be light-weighted

Vertex Histogram

Represent PDG g byh(g) = (n1, n2, …, nk),

where ni is the frequency of the ith kind of vertices.

Similarly, represent PDG g’ byh(g’) = (m1, m2, …, mk).

Direct similarity measurement? How to define a proper similarity threshold? Is thus defined threshold program-independent?

Hypothesis Testing-based Approach

Basic idea Estimate a k-dimensional multinomial

distribution from h(g) Test whether h(g’) is likely an

observation from If it is, g’ looks similar to g, and an

isomorphism testing is needed. Otherwise, (g, g’) is discarded

Technical Details

Technical Details (cont’d)

Work-flow of GPLAG

PDGs are generated with Codesurfer

Isomorphism testing is implemented with VFLib.

Outline

Experiment Design

Subject programs

Effectiveness Filter efficiency Core-part plagiarism detection

Effectiveness

2-hour manual plagiarism, but can be automated? GPLAG detects all plagiarized PDG pairs within 1 second PDG isomorphism also reveals what plagiarism disguises are applied

Efficiency

Subject programs bc, less and tar. Exact copy as plagiarism.

Lossless and lossy filter Pruning PDG-pairs. Implication to overall time cost.

Pruning Uninteresting PDG-pairs

Lossless only Lossless and

Implication to Overall Time Cost

Time-out for subgraph isomorphism testing, time hogs.

Lossless filter does not save much time.

Lossy filter significantly reduces the time cost.

Major time saving comes from the avoidance of time hogs.

Detection of Core-part Plagiarism

Lower time cost with lossy filter. Lower false positives with lossy filter.

Outline

Conclusions

We developed a new algorithm GPLAG for software plagiarism detection

It is more effective to fight against “professional” plagiarists

We developed a statistical lossy filter, which improves the efficiency of GPLAG

We experimentally verified the effectiveness and efficiency of GPLAG

Thank You!

References[1] B. S. Baker. On finding duplication and near duplication in large software

systems. In Proc. of 2nd Working Conf. on Reverse Engineering, 1995.[2] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. Clone detection

using abstract syntax trees. In Proc. of Int. Conf. on Software Maintenance, 1998.

[3] K. Kontogiannis, M. Galler, and R. DeMori. Detecting code similarity using patterns. In Working Notes of 3rd Workshop on AI and Software Engineering, 1995.

[4] T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7), 2002.

[5] L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with JPlag. J. of Universal Computer Science, 8(11), 2002.

[6] S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. SIGMOD, 2003.

[7] V. B. Livshits and T. Zimmermann. Dynamine: Finding common error patterns by mining software revision histories. In Proc. of 13th Int. Symp. on the Foundations of Software Engineering, 2005.

[8] C. Liu, X. Yan, and J. Han. Mining control flow abnormality for logic error isolation. In In Proc. 2006 SIAM Int. Conf. on Data Mining, 2006.

[9] C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu. Mining behavior graphs for ”backtrace” of noncrashing bugs. In SDM, 2005.

1 GPLAG: Detection of Software Plagiarism by Program Dependence Graph Analysis Chao Liu, Chen Chen,...

Documents

Hong Cheng Jiawei Han

Synthesis and dielectric properties of MgTiO and Cu0.5Ti0 ... · [Ching-Fang Tseng, Chao-Chen Chen, Chen-Wei Lin, Materials Chemistry and Physics, 147 (2014) 535-539]. Fig. 3 The

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB

Nanjing School: Extracellular microRNA mediates co ... · Nanjing School: Extracellular microRNA mediates co-evolution between species AUTHORS Chao Yan 1,2, Xi Chen , Chen-Yu Zhang1,2

Mining Behavior Graphs for Backtrace of Noncrashing Bugs Chao Liu, Xifeng Yan, Hwanjo Yu, Jiawei Han University of Illinois at Urbana-Champaign Philip

English Majors: Missions and Practices Chao-ming Chen Chair Professor of USC

The Credibility of Stock Repurchase Signals Chao Chen Min-Ming Wen

Inferring Analogous Attributesgrauman/papers/chen-attributes-cvpr2014.pdfInferring Analogous Attributes Chao-Yeh Chen and Kristen Grauman University of Texas at Austin chaoyeh@cs.utexas.edu,

Evaluation and Improvement of Health Care Systems Luting Kong Yiyi Chen Chao Ye Beijing University

Weakly-Supervised Neural Text Classification · Weakly-Supervised Neural Text Classification Yu Meng, Jiaming Shen, Chao Zhang, Jiawei Han Department of Computer Science, University

T RESEARCH OF CONFUCIUS I Y ( ) - huichawaii.orghuichawaii.org/wp-content/uploads/2017/02/Chen-Chao-Chao-2017-AH… · 2017 hawaii university international conferences arts, humanities,

Library as Publisher: the ideas and experiences of National Taiwan Normal University Joyce Chao-chen Chen Professor & University Librarian National Taiwan

Chen Chen 1, Cindy X. Lin 1, Matt Fredrikson 2, Mihai Christodorescu 3, Xifeng Yan 4, Jiawei Han 1 1 University of Illinois at Urbana-Champaign 2 University

· The Effect of Three Herbicides on the Ammonification and Nitrification in the Soil 3k Gwo-Chen Li Chao-Yuen Chen Reprinted from National Science Council Monthly

Mining Graph Patterns Efficiently via Randomized Summaries Chen Chen, Cindy X. Lin, Matt Fredrikson, Mihai Christodorescu, Xifeng Yan, Jiawei Han VLDB’09

Chao-Hsiu Chen Voler comme - Sogides

Inferring Unseen Views of People - Department of …grauman/papers/chen-pose-cvpr2014.pdf · Inferring Unseen Views of People Chao-Yeh Chen and Kristen Grauman University of Texas

Repp Construction Database Design Group #1 David Adler Jiawei Chen Brittany Popovski Savannah Smith Jie Xuan Zhao

Advisor:Wen-Shiung Chen Student: Min-Chao Chang

Scalable IPv6 Lookup/Update Design for High-Throughput Routers Authors: Chung-Ho Chen, Chao-Hsien Hsu, Chen -Chieh Wang Presenter: Yi-Sheng, Lin ( 林意勝