[IEEE 2008 IEEE International Symposium on IT in Medicine and Education (ITME) - Xiamen, China (2008.12.12-2008.12.14)] 2008 IEEE International Symposium on IT in Medicine and Education

The Briefest Reduct of Rough Sets Based On Genetic Algorithm

GUAN Hong-bo , YANG Bao-an1.Glorious Sun School of Business and Management Donghua University 200051

Shanghai China2.College of Economics and Management Shanghai Ocean University,201306 Shanghai,

[email protected];[email protected]

Abstract

This paper focuses on the discussion about the briefest reduct of Rough Sets which is extracted by genetic algorithm. The fitting function is designed by the combination of the relying degree of RS and sum of seeds which is the attributes of data. Genetic Algorithm operator is applied and the algorithm is tested by UCI database. After the analysis and discussion, RGA and RGA_2 have been proved available. In the discussion, crossover and mutation probability have got a experienced number. Increasing the sum of seeds, and saving the last generation optimized seeds can be improving the efficiency of algorithm

1. Introduction

The Rough Sets approach is firstly given by Pawlak in 1982 [1], [2], [3], and developed very fast in recent years. It provides a frame for reasoning vagueness and uncertainty.

Currently, rough set theory has been widely used in machine learning, fault diagnosis, access controlalgorithm, process control and database applications, such as knowledge acquisition area, and have achieved great success.

The Rough Sets uses the two concepts of lower and upper approximation to denote the vague concept. It has been proved that the method derived from the Rough Sets theory can find the briefest attributes, which separate one class or classification from the others. Moreover, the Rough Sets theory has potential to handle noise in the data set, i.e. some data can be categorized to different classes or classifications at the same time.

But Rough Set attributes reduct is an NP-hard problem [3], [4], [5], [6] in the calculation. Since the reduction is an important issue in rough set theory, thus it is also a discussion of important significance. In this paper, a genetic algorithm is designed to solve this problem and achieved good results.

2. Genetic algorithm

Genetic Algorithm [7], [8], [9] which is based on a theory of evolution, is an efficient stochastic searching and optimization methods. The main feature is the group searching strategy and the exchange of information between individuals, the search is not dependent on gradient information. So far, the genetic algorithm is the most well-known algorithm evolutionary algorithms.

Population: known as chromosome group, the individual pool, represent subset of the solution space. String and string space: string is the form of individual expression, which corresponds to chromosome in genetics and to solution for the practical problems. Population size: The number of individual in chromosome group is known as population size. Gene: refers to a chromosome fragments of genes can be a number, a array of number or a bunch of string. Crossover: means under certain conditions, one or more genes on chromosome exchange their positions. Probability: a number less than 1 to decide whether or not to meet the threshold of exchange. Mutation: means under certain conditions randomly change one or more genes on chromosome. Mutation probability: a threshold less than 1 which decide whether or not to mutate. Offspring: a new individual after a crossover and mutation. Fitness: numerical value to measure individual quality. Selection: In accordance with the fitness of chromosome, select the higher ones and omit the smaller ones.

Proceedings of 2008 IEEE International Symposium on IT in Medicine and Education

23

978-1-4244-2511-2/08/$25.00 ©2008 IEEE

3. The RS genetic reduction algorithm

Rough Set attributes reduct computing is a combinatorial optimization problem, and genetic algorithm’s character is helpful to the optimization problem. Therefore, this paper selects the genetic algorithm to explore the reduct issues and named the new algorithm GRA.

3.1 RGA algorithm design

Firstly, discuss the coding problems. In the attribute group, the selected attributes can be represented as 1 and the unselected attribute as 0. So the reduct (or solution) can be expressed the form of 0, 1, the sum of attribute group is the length of solution. Then, the briefest reduct of decisive table can be expressed the form of the gene 0, 1.

Fitness function is the key to this algorithm design, using the relying degree [1], [3], [4], [6] to measure the degree of importance of seeds.

In Rough Sets, let U denotes the finite union of elementary sets, associated with Q, will be called a Q --definable set. Let , P and Q are the two attribute subsets of U; the importance of a attribute “a” can be computed. The relying degree of P to R is

QRP

)(PR .Then make the definition as follow:

)(/))(()( UcardPPOScardP RR

}/)(|{)( PUXXaprxxPposRR

card is equal to the sum of the setAlthough the relying degree can measure the

quality of seeds, but seed is not the optimal solution or relative. So we consider the length of seed to define the fitness function value.

Let )(PR denote relying degree, the length of seeds is L(L 0),then the fitness function:

Fitness= )(PR /L.The operator selection, crossover and mutation,

are designed as the traditional genetic algorithm. Stop using iterative generation which can be given in advance.

Algorithm flow is followed as Graph 1:

3.2 The experiment of the RGA algorithm

The experiment data apply UCI database. Bread_cancer is a set of 699 × 9 delete the missing value, the last data set is 683×9.

Choose the initial solution: 18 seeds, crossover probability 0.8, mutation probability 0.05, the number of iterations for 20, find the briefest reduct set: (1, 3, 5, and 6).

S election

generation<= GA_num

In terchange

Mutation

generat ion++

Solu tion

Graph 1 RGA Algorithm Flow

Monk1 is other data sets include 127 × 7, the initial solution: 28 seeds, crossover probability 0.8, mutation probability 0.05, the number of iterations for 200, find the briefest reduct set: {1 2 5}.

Monk2 is other data set includes 432 × 7 the initial solution: 28 seeds, crossover probability 0.8, mutation probability 0.05, the number of iterations for 200, find the briefest reduct set: {1 2 3 4 5 6}.

The GRA algorithm is proved successful to get the briefest reduct after numerous experiments.

3.3 The analysis of GRA

The key step of RGA is the fitness function which combining the relying degree and the number of attributes is feasible and effective. The relying degree measures the classification of the seeds, and the sum of selected attributes ensures the seed’s length is short. So the seeds which have strong classification capabilities and shorter length remain to participate in evolution, crossover and mutation operations. Because GRA has convergence character, we can find the briefest reduct.

For the “Breast cancer” data sets, the iteration is 20 and the process table 1 is followed.

From the table 1, the length of the seeds in the first generation is 6, up to 15th generation the optimal solution is found {1 3 5 6}. The length of solution is becoming small. From the analysis above, the algorithm is proved continued convergent, gradually finds the briefest reduct.

Portfolio Problem can be solved through exhaustive way. For example, Breast_cancer data set containing the nine attributes, all the exhaustive, to calculate 29 times, then 512 times, but the genetic algorithm used 20 times, a decrease of 96% of the time. For another example UCI database “Zoo”, containing the 17


24

attributes, will be calculated more than 130,000 to obtain optimal solution, but the genetic algorithm, can be 200 times. Comparing with the exhaustive algorithm, RGA algorithm is an efficient algorithm.

Table1. Computing Process

No Optimal Solutions Generation

1 1 2 4 6 7 8 12 1 4 5 6 8 9 13 1 2 3 5 6 7 14 1 2 3 5 6 8 25 1 4 5 6 8 26 1 3 5 6 157 1 3 5 6 178 1 3 5 6 19

4. The question and improving of RGA

Trough the data “ZOO” of UCI, we discuss the existed issues of GRA and give an improved algorithm. The algorithm parameters, the relying degree and length of the seeds are further discussed. This data was chosen is because a lot of the briefest reduct in the data but failed to find the briefest reduct.

Data: ZOO; attributes: 16; records: 101.

4.1 Change initial population

Experimental parameters: pcross = 0.8; pmutation = 0.01; q = 0.5; GA_num = 100.

Experimental output as shown in table 2: it can be clearly seen that the apparent convergence of the population. When in the 100th generation the best solution is seed 8. However, the above mentioned population did not find the best solution comparing with the software RSES [10, 11, and 12] to find the best solution for the length of 5, the length of best solutions is 6. Moreover, in numerous experiments, it is often not the solution. Why there is solution or not, after analysis, the original population is very important. If the population contains all or part of the optimal solutions, the convergence is fast and the optimal solution can be found. Otherwise, the convergence is slow and can’t sure find the solutions in the generation 100.

Comparing the data “Zoo” with the data “monk” and “Breast cancer”, the attributes of “Zoo” is found more than the other two. In the RGA algorithm, it only reflects the population size with the size of the recorded data, does not reflect it with attributes of the

relationship, so in the initial population, adding a positive correlation between the population size and the number of attributes.

3801

800

10020000clumnrow

epopsize

Table2. Population of Generation 100NO

Optimal Solution

1 3 4 6 7 8 9 13 2 3 6 7 8 9 11 13 15 163 3 4 6 7 8 9 13 4 3 5 6 7 9 13 14 165 3 4 6 7 8 13 15 166 3 4 6 7 8 9 13 7 3 4 6 7 9 11 13 148 3 4 6 7 8 13 9 3 4 6 7 8 13 15 16

10 3 4 6 7 8 9 13 11 1 3 4 6 7 9 13 16

More experiment proves that by increasing the size of population, the possibility of the RGA algorithm to find a solution improved a lot.

4.2 Algorithm’s optimization

Although the initial population increases, there are still seeking less than optimal solution. In order to find reasons, tracking algorithm and finding in the calculation process, the emergence of the optimal solution, but did not maintain because of the operator of cross or mutation. The GRA’s purpose is to seek the optimal solution, so saving the optimal solution among the every generation should also become the part of the algorithm. If the solution to every generation is equal to or greater than the optimal solution of previous generation, it could save in the next population. The optimal solution in all generation on can be recorded, so the optimal solutions can be got.

RGA still named after the optimization. The test of new algorithm showed that the optimal solution can be found { 2 3 5 8 12 } and { 2 5 8 12 15}about the data “ZOO”. New RGA find not only one optimal solution, but also more optimal solutions.


25

Generation <= GA_num

Select

Exchange

Mutation

Generation++

Optimal Solutions

Opitimal Solutions Are Saved.

Graph 2.The Flow Diagram after Optimizing

4.3 The relationship of relying degree and the length of seeds

4.3.1 Change fitness function. Let’s consider the fitness function:

Fitness= )(PR /L.

If )(PR =1, L=5 and )(PR =0.4, L=2, Then the Fitness1=Fitness2. Which seeds are more

important than another? So which is important in fitness function about the relying degree and the length of one seed? So the weight of them can be set up. By the experiment the weight value can be test. Now the new fitness function can be written as:

Fitness= q × )(PR + (1 – q) × (|A|- L)/|A| A is attribute and |A| is the number of A. By changing the weights q, we can test the

important degree of two factors. To facilitate comparison, change the algorithm name as RGA2.

4.3.2 Experiment. Experiment dataset ZOOParameters: pcross = 0.8 pmutation = 0.01 GA_num = 30 popsize=60.

The weight “q” is adjusted from 0 to 1.It is found that when the weight “q” become greater, that means the increasing of relying degree, the convergence rate slows down, and ultimately don’t converge. With the weight “q” decreases, the speed of convergence become fast, the optimal solution is most easily obtained when the “q” is 0.2. But when the “q” is too small to nearby the zero, the algorithm isn’t converged. Therefore, the “q” takes from 0.15 to 0.3, the algorithm is good performance.

When the “q” is equal to 0.2 the optimal solution is {3, 4, 6, 9, 13} and {4, 6, 9, 12, 13}.

4.3.3 Comparison with RGA. After experiments, the two algorithms are almost equal when the RGA2 applies the “q” from 0.15 to 0.3.About the test of “Zoo”, the two algorithms can find the optimal solution less than 30 generations.

4.4 Crossover probability and mutation Probability

Crossover probability and mutation probability affect the convergence of the algorithm. After the trial, crossover probability rang from 0.4 to 0.8, and mutation probability is from 0.005 to 0.05. For crossover probability, if the value is too small or too large, the convergence is not very good. Similarly, the same problems exist for the mutation rate.

5 Conclusions

This paper focuses on the briefest reduction of rough set theory. On the basis of the relying degree, the genetic algorithm RGA and RGA2 are designed, and the algorithms are in-depth analysis and discussion. The algorithm was tested and proved an effective algorithm. But there are also problems about speed of the algorithm to calculate the huge data set, how to improve the algorithm's performance. The next step will be to do the work.

This paper is assisted by the Shanghai Education Commission Key Subjects: "Food Economic Management" (NO: J50703).

6 References

[1] Z. Pawlak, “Rough Set: Theoretical Aspects of Reasoning about Data”, Dordrecht: Kluwer Academic Publishers, 1991.

[2] W. Ziarkok, Shan N, “KDD-R: A comprehensiveSystem for Knowledge Discovery in Databases Using Rough Sets .Lin T Y(ed.)”, Conference Proceeding of the Third international Workshop on Rough Sets and Soft Computing (RSSC’94), San Jose , California , USA , pp:164-173,1994.

[3] Huang ZL, “Rough Sets Theory & Application: ANew Method For Data Reasoning”, Chongqing University Press. 1996.

[4] Hu X, “Knowledge Discovery in Databases: An attribute-oriented Rough Set Approach [D]”, University of Regina. Canada, 1995.

[5] Lu RQ, “Knowledge Engineering and Science At Centenary”, Tsinghua University Press, September, 2001.


26

[6] Guan HB, Tian DG, “Rule Abstracting Algorithm by Decisive Tree Based on the Importance of Attribute”, System Engineering & Electronic Technology, Vol 26, No 3, pp:334-337,2004

[7] Pan ZJ, Kang LS, “Evolvement Computing [M]”,Tinghua University Press, GuangXi Science & Technology Press, April, 2000.

[8] Jiawei Han, Micheline Kamber. Dataming Concepts and Techniques. Machinery Industry Press , August, 2001.

[9] Shi ZZ. Knowledge Discovery. Tsinghua University Press. 2002.

[10] Wangjue, Miaoduoqian. Analysis on Attribute Reduction Strategies of Rough Set. Journal of Computer Science and Technology. Vol 13, No 2, pp: 189-192, 1998.

[11] Wangjue, Cui jia, and Zhao kai. Investigation onAQ11, ID3 and the Principle of Discernibility Matrix. Journal of Computer Science and Technology. Vol 16, No 01, pp:1-12, 2001.

[12] Bjorvand A T. “Rough enough”-A system supporting the Rough Sets Approach[EB/OL] http://home.sn.no/~torvill.


27

Documents

[IEEE 2008 IEEE International Symposium on IT in Medicine and Education (ITME) - Xiamen, China (2008.12.12-2008.12.14)] 2008 IEEE International Symposium on IT in Medicine and Education