Analysis and Preliminary Thoughts in Model Clone Detection

Analysis and Preliminary Thoughts

in Model Clone Detection

Wenjun Luo, Xiaochi Ma, Jinglei Xu

College of Computer Science and Software Engineering

Shenzhen University

Dec. 18, 2011

Abstract

In this work, an idea of fingerprint-based algorithm for model clone detection in graph-based

dataflow models is presented. The concept of exact clone detection is to enumerate all the

maximal, isomorphic, disjunctive and connected sub graphs. And there are some sub graphs that

are not isomorphic but have the same structure or similar structures, which we called them

approximate clone. Our algorithm works on both of accurate clone detection and similar clone

detection. Also, as the problem of clone detection in graphs is known to be NP-complete, there is

no polynomial solution for it.

Since we didn’t have a full achievement of our algorithm, this work is mostly present the ideas of

our algorithm and the comparison with other existing algorithms in this field.

Keywords: model clone, clone detection, fingerprint-based, LSH

Content

1. Introduction .............................................................................................................................. 4

1.1 Occurrence of clone and clone detection ..................................................................... 4

1.2 Clones in models ........................................................................................................... 4

1.3 Advantages and disadvantages of cloning .................................................................... 4

1.4 Differences between model cloning and code cloning ................................................. 4

2. Existing Algorithm Analysis ....................................................................................................... 5

2.1 Index-based Model Clone Detection ............................................................................. 5

2.2 Automotive Model Clone Detection ............................................................................. 6

2.3 E-Scan and A-Scan ......................................................................................................... 6

2.4 Analysis.......................................................................................................................... 7

3. Preliminary Thought .................................................................................................................. 9

3.1 Locality-Sensitive Hashing ............................................................................................. 9

3.2 Fingerprint ..................................................................................................................... 9

3.3 Preliminary Approach .................................................................................................. 10

3.4 Analysis........................................................................................................................ 10

4. Conclusion ............................................................................................................................... 12

5. References ............................................................................................................................... 13

1. Introduction

The concept of clone is first created by biologists while they doing reach on organic gene. Later

people classify the phenomena or operations that have the same characteristics of gene clone

into clone. The clone we talk about is in IT industry fields and not the bio one.

1.1 Occurrence of clone and clone detection

When you analysis your code after programming it, you may found that there are many code that

have been reused in many places in your program. These codes may form into a function then

put into a package for later use. Finally, the package you formed is called mother-body, and clone

in the code that is the same of the mother-body which you’ve put in your program. And the

procedure of finding these clones is called clone detection.

1.2 Clones in models

Some big industry enterprises like BMW, a well-known car manufacturer in Germany, suffering

from the difficulty of improve and update their car core system just because a litter change will

cause a whole rebuild for the system, which will cost a lot of money and time. BMW design their

cars by many different models. These models are the components of a car. Each model has

versions on them and different class of the car may use different version of the model. So if a

place has change in the models, you can imagine what would happen next. This is much more

complicated than the clone in codes.

1.3 Advantages and disadvantages of cloning

There won’t be any doubt that using clone can makes our code more clear and easy-fix, and

enterprises reduce their cost by clone their products. However, this also became a problem that

people may use codes without authorizations, which we called plagiary. And it became much

worse in updating the product line in a factory, because this is already outside the codes, people

have you find it out one by one all by themselves.

1.4 Differences between model cloning and code cloning

As a matter of fact, model cloning and code cloning are all shares the same property, replication.

The most different part between them is the clone level. The code clone is mostly occurred in a

low level while the model clone is in a higher one. Since model driven theory of software

engineering and OOP gain such a high focus nowadays, cloning the UMLs or structures from one

project to another can be commonly seen in IT industry.

2. Existing Algorithm Analysis

So far, there only exist a few algorithms in this field. The major algorithms for model clone

detection are from U.S. and Germany. They are the index-based model clone detection algorithm

[1], the automotive model clone detection algorithm [2], e-Scan and a-Scan [3]. We will have a

brief introduction to their processes and have an analysis of them.

2.1 Index-based Model Clone Detection

In [1], an algorithm called index-based model clone detection was presented. This algorithm

mainly performs the detection by the following steps. Firstly, a graph from a MATLAB/Simulink

model will be extracted and normalized into a directed labeled multi graph. Blocks and edges in

the original MATLAB/Simulink model are corresponded to nodes and lines in the normalized

graph. Then this graph will be processed into a list of sub graphs with a specific size k. after that,

each of the sub graphs in the list will be merged into a hash table. Finally, the last process will get

the maximal clone group as the final result. All the procedure is shown in Figure 1.

Figure 1: Index-based Model Clone Detection [1]

2.2 Automotive Model Clone Detection

In [2], automotive model clone detection is presented, which the index-based model clone

detection is based on. It first preprocesses and normalizes the input models. Then do the

extraction of the clone pairs and clustering these pairs to also find substructures by using more

than twice in the models. In order to have a polynomial time complexity algorithm for

enumerating all maximal clone pairs in large cases, a heuristic approach, which is shown in Figure

2, has been built.

Figure 2: Heuristic for detecting clone pairs [2]

2.3 E-Scan and A-Scan

In [3], two algorithms, eScan for exact clone detection and aScan for approximate clone

detection, are presented. They are the core algorithms in an open source model clone detect

software called ModelCD, planted in CONQAT. The preprocessing and normalization of these two

algorithms are similar; we are not going to have a detail description for them. The eScan

performs by the steps that are shown in Figure 3, while the steps of aScan are shown in Figure 4.

Figure 3: eScan performing exact clone detection [3]

Figure 4: aScan performing approximate clone detection [3]

2.4 Analysis

All the approaches presented above are the main approaches in model clone detective field.

Before we presented our preliminary thoughts about how to perform and how to improve the

existing algorithms, we are now presenting some analysis on the comparison between them.

Since detection in models is proofed be a NP-complete problem, the most important feature we

care about is the time complexity of the detective algorithm.

Table 1: The Result of Analyzing the Time Complexity of Index-base Clone Detective,

Automotive Clone Detective, eScan and aScan

Approaches Time Complexity(Core Algorithm)

Index-based Detection ( (

) ( )), (

)

Automotive Detection ( )

eScan ( )

aScan ( )

We have analyzed all these algorithms (result shown in Table 1). The time complexity of

index-based detective contains two main properties. The first complexity is the complexity of

the enumeration of sub graphs of a give size of k. while is the complexity of finding maximal

clone groups. All the complexity of each algorithm is only one or two parts of the core of them.

From the analysis we can easily found that if we focus on the correctness of the detection, then

we might lose the time efficiency.

We also have a feature analysis of these algorithms and the analysis results are listed in Table 2.

Table 2: Feature Analysis Result of the Index-based Detective, Automotive Detective, eScan

and aScan

Features Index-based

Detective

Automotive

Detective eScan aScan

Exact Clone Detect O O O X

Approximate Clone

Detect X O X O

Minimal Clone

Detect Size Support O O O O

Maximal Clone

Detect Size Support O O O O

Speed GENERAL FAST GENERAL GENERAL

Incremental Detect

Support O X O O

Detect Correctness GENERAL GOOD BEST NOT GOOD

Completeness GENERAL NOT GOOD BEST GOOD

Stability GENERAL BEST GOOD GOOD

In the table, “O” stands for yes and “X” stands for no. “GENERAL” is the lowest level in this analysis result.

So, as far as we are concerned, though all these algorithms are able to deal with some large

cased, within its own limitations, but there still not exist a perfect solution for every aspects. So

we are considering a better approach for clone detection, both exact and approximate.

3. Preliminary Thought

Here, we are going to present a preliminary thought of model clone detection based on the

algorithms above and add some new features inside by referencing a Locality-Sensitive Hashing

(LSH) and fingerprints. In this section, we will first have a glance at LSH and fingerprints (see in

3.1 and 3.2). Then we are going to explain how we can improve the existing algorithms (see in

3.3). Finally, we will have a brief analysis on our thought (see in 3.4).

3.1 Locality-Sensitive Hashing

Locality-Sensitive Hashing (LSH) is an algorithm for solving the (approximate/exact) Near

Neighbor Search in high dimensional spaces [4]. The main algorithm of LSH is shown in Figure 5.

Figure 5: Algorithms for initializing a hash function 𝐡 from the LSH hash family, and for

computing 𝐡(𝐩) for a point 𝐩 ∈ 𝐑𝐝 [4]

3.2 Fingerprint

Fingerprint was first appear and be used in biology fields because of its identical feature. In code

clone detective fields, a fingerprint stands for the recognizable features for each fragment that

was created [6]. These fingerprints may appear as a form of dynamic array list and they are use to

make clone clusters. Fingerprints can be store in file system, a database or just temporary store in

the RAM.

3.3 Preliminary Approach

Our approach will pull in LSH and fingerprint. And based on the existing algorithms, our approach

may have a less time complexity and totally support the incremental detect without sacrificing

the detect correctness. The overview of our system is shown in Figure 6.

Figure 6: Overview of our system based on our preliminary thoughts

In this system, we first, as usual, parse the model in to a graph called Original Graph with nodes

and labeled lines that have stored the information in the model. Then we start enumerating and

normalizing so that we can get k size sub graphs. After that, we will using LSH to create

fingerprint for each of the sub graph. These fingerprints will be stored in a database. This will

enable the system support an incremental detection. The grouping process will start after all the

fingerprints were stored in the database. We also need a filter to ignore some useless clone.

Finally, when there is no new fingerprint added in the database, which means the entire clone

pairs have been max grouped, we can get the final result.

3.4 Analysis

This detect system is only a concept model base on the preliminary thoughts. We have not had a

complete and runnable example on this approach. But theoretically, this system may have some

new features, shown in Table 3, compares to the other approaches.

Table 3: Feature Comparison between Our Work and the Existing Approaches

Features Index-based

Detective

Automotive

Detective eScan aScan Our Work

Exact Clone

Detect O O O X O

Approximate

Clone Detect X O X O O

Minimal Clone

Detect Size

Support

O O O O O

Maximal Clone

Detect Size

Support

O O O O O

Speed GENERAL FAST GENERAL GENERAL NOT

FAST

Incremental Detect

Support O X O O O

Detect Correctness GENERAL GOOD BEST NOT

GOOD GOOD

Completeness GENERAL NOT

GOOD BEST GOOD GOOD

Stability GENERAL BEST GOOD GOOD N/A

Extra Storage NOT NEED NOT NEED NOT

NEED

NOT

NEED NEED

By pulling fingerprint in clone detection, we think, will be able to have a more clear structure for

the later grouping and clustering. Also, because the fingerprints can be able to store in the

database, we can get a record for the process status. By doing this, we can have incremental

detect support. Besides, by changing the fingerprint vector, we can be able to have detection at

different depth.

In the preprocessing and normalizing of the input models, we have a data mining process so that

we can make a property and correct structure for the model. By using LSH, we think, can be able

to have a less time complexity in those cases which have a lot of sub-systems and tens of

thousands of blocks. Also, LSH may help us to build fingerprints.

4. Conclusion

The approach that we presented is still under programming and testing. We need to find out with

all these algorithms get together, will the pre-process and normalization need to cost more time

than other approaches. Theoretically, this approach will have a good correctness and

completeness if we construct a proper vector for the clone pairs comparison. Though it is not a

polynomial time complexity, we believe this is an improvement compare to the existing

approaches.

5. References

[1] Daniela Steidl. Index-based Model Clone Detection. In Technology University of München,

2010.

[2] F. Deissenboeck, B. Hummel, E. Juergens, B. Schätz, S. Wagner, J. F. Girard, and S. Teuchert.

Clone detection in automotive model-based development. In Proc. of ICSE '08, pages 603-612,

2008.

[3] Nam H. Pham, Hoan Anh Nguyen, Tung Thanh Nguyen, Jafar M. Al-Kofahi, Tien N. Nguyen.

Complete and accurate clone detection in graph-based models. In Proc. of ICSE ’09, software

engineering, pages 276-286, 2009.

[4] Alexandr Andoni, Piotr Indyk. Near-optimal hashing algorithms for approximate nearest

neighbor in high dimensions. In Proc. of 47th Annual IEEE Symposium on FOCS’06, 2006.

[5] F. V. Rysselberghe, S. Demeyer. Evaluating clone detection techniques from a refactoring

perspective. In Proc. of the 19th International Conf. on Automated Software Engineering (ASE’04),

2004.

[6] M. Chilowicz, E. Duris, G. Russel. Syntax tree fingerprinting: a foundation for source code

similarity detection. In the IEEE 17th International Conference on Program Comprehension

(ICPC’09), 2009.

Documents

Analysis and Preliminary Thoughts in Model Clone Detection