MCS680: Foundations Of Computer Science

Brian Mitchell ([email protected]) - Drexel University MCS680-FCS

1

Case Study:

Automatic Techniques

For Software Modularization

int MSTWeight(int graph[][], int size){

int i,j;int weight = 0;

for(i=0; i<size; i++)for(j=0; j<size; j++)

weight+= graph[i][j];

return weight;}

1

1

nn

O(1)

O(1)

O(n) O(n)

Running Time = 2O(1) + O(n2) = O(n2)

MCS680:Foundations Of

Computer Science


2

Introduction

• This topic reinforces the concepts of set and graph theory by demonstrating a current research area– Algorithms for Automatic Software

Modularization

• This research was conducted by Drexel faculity:– Brian Mitchell

– Spiros Mancoridis

– Chris Rorres


3

Software Engineering Problem

• Software maintenance is an arduous task because of the difficulties associated with understanding the intricate relationships that exist between the source code components– Design document is inaccurate

– Original system architect/designer is no longer available for consultation

• With no mechanism for gaining insight into the system design and structure, the software maintenance practitioner is often forced to make modifications to the source code without a through understanding of the systems organization

• Also, heavily used software systems change rapidly– Use of an “ad-hoc” maintenance approach

will negatively affect the system design


4

Software Engineering Problem

• Software engineers have long known of the difficulties associated with maintaining software systems whose only current documentation is limited to the source code

• Leads to decay in the design due to source code changes that are made without an understanding of the system structure– Size of modern day software systems is

beyond a programmers cognitive ability to determine the affect of a local change on the entire system

– Changes made to the source code without an understanding of it’s organization usually contradict one or more aspects of the original design

• Goal is to give the programmer a tool that visualizes the modularization of the system


5

Other Work In Field

• Top-Down Approaches– Tools such as “Rigi” and “Arch” have been

developed to perform a modularization of a software system

• Still requires somebody familiar with the system to provide feedback and/or set system-specific parameters

• Bottom-Up Approaches– Software Reflection Model

• Used to capture and exploit the differences that exist between the actual source code organization and the designers high-level model of the systems modularization

• Streamline learning process

– The Orphan Adoption Problem• Given the name of a new software resource (an

orphan), this tool emits as output the name of the subsystem that has been chosen as the parent for the orphan


6

Our Automatic Modularization Tool

• Implements algorithms that we developed that

– Are fully automatic– Recursively generates a hierarchical view

of of the system organization based solely on information extracted from the source code

• Fully automatic techniques are not only useful to programmers that lack familiarity with the system, but can also be used by the system architect to compare the documented modularization, with the one created by our tool and learn from the differences


7

Software System Organization

• Software systems contain a finite set of software components and a collection of relationships that govern how the software components interact with each other

• Typical software components– Classes, Modules

– Variables, Macros

– Structures

• Typical software relationships– Import

– Export

– Inherit

• Can represent the system structure as a resource dependency graph– The information required to build this

graph can be obtained by parsing the source code


8

Example Resource Dependency Graph: Plan9

• The following resource dependency graph was automatically generated by scanning the source code from the file system of the Plan9 operating system– Access to source code provided by AT&T

Labs


9

Goals of Research

• Goal of our research is to automatically partition the components of a system into clusters that maximize cohesion and minimize coupling

• The clusters once discovered represent a higher level abstraction of the systems organization by grouping related software components into subsystems

• Each subsystem contains a collection of modules that either– Cooperate to perform some high-level

function in the overall system• Scanner, parser, code generator

– Provide a set of related services that are used throughout the system

• Import Library

• File manager, memory manger


10

Automatically Modularized Visualization of Plan9 OS

• The following graph was derived by our clustering utility

• Formal definitions for cohesion, coupling and modularization quality must now be developed in order to illustrate our process


11

Architecture of our Clustering Environment

{ cout ...}

Source Code Modules

CIAUtility

scan

Parse Source Code

XREFDatabase

generate

Awk Script- Query- Format

scan ClusteringEngine

generate

DOTFile

read

DOTTYUtility

read ClusteredGraph

display


12

Quantifying Cohesion

• Cohesion is an indication of the strength of the relationships that exist between modules that are grouped into a cluster. – High cohesion = Strong Encapsulation.

• We define cohesion (H) as a measurement of intra-edge dependencies between the components in a particular cluster.– Formally, the cohesion Hi of cluster i consisting

of Ni components and i intra-edge dependencies is:

• This measurement is a percentage of intra-edge dependencies, which is Ni

2.

2i

ii

NH

Subsystem 1

M1

M2

M3

Number of modules in subsystem, N i = 3Number of intra-edge dependencies, i = 2Maximum intra-edge dependencies, N i

2 = 9

2222.09

22

1

11

NH


13

Qualifying Coupling

• Coupling (C) is a measurement of inter-edge dependencies between the components of two distinct clusters

• The coupling Ci,j between clusters i and j each consisting of Ni and Nj components respectively, and i,j inter-edge dependencies is:

This measurement is a percentage of the maximum number of inter-edge dependencies between clusters i and j

jiif

NN

jiif

C

ji

jiji

**2

0

,,

Subsystem 1

M1

M2

M3

Number of modules in subsystem 1, Ni = 3Number of modules in subsystem 2, Nj = 2Number of inter-edge dependencies, i,j = 2

1666.012

2

**2 21

2,12,1

NNC

Subsystem 2

M4

M5


14

Modularization Quality

• Modularization Quality (MQ) is defined as the measurement of the “goodness” of a particular system modularization.– Specifically, the MQ of a modularization of

k clusters, where Hi is the cohesion of the ith

cluster and Ci,j is the coupling between the ith and jth clusters is:

– This measurement shows the trade-off between cohesion and coupling by

• Rewarding many small highly-cohesive clusters

• Penalizing too many inter-edges

1

1

2)1(

1

1, ,1

kifH

kifkk

C

k

H

MQ

k

i i

k

ji jik

i i


15

Modularization Quality Example

Subsystem 1

M1

M2

M3

Subsystem 2

M4

M5

Subsystem 3

M6

M7

M8

186.03

083.00.016.0

3

33.025.022.0 MQ


16

Partitions of a Set

• Must construct a data model to represent a partition (a clustering) of a software system

• Consider the source code organization for system S. – S = {M1, M2, …, Mn}

– Let a collection = {A1, A2, …, An} be a set of non-empty subsets such that each Ai S. is a partition of S if:

• The subsets are a covering of S

• The subsets are mutually exclusive

• Each subset Ai is called a cluster of the partition

• A partition of S onto k non-empty clusters is called a k-partition of S

n

i i SA1

jiAA ji ,


17

Number of k-Partititions of a Set

• Let S be a set of n elements. The number of k-partitions of an n-set satisifies the recurrence equation:

• The entries Sn,k are called Stirling numbers

• Striling numbers govern the number of k-partitions of a set.

• Stirling numbers grow exponentially with respect to the size of S.

otherwiseSkS

nkkifS

knknkn

,11,1,

11


18

Clustering: Optimal Solution

• Algorithm– Let S = {M1, M2, …, Mn}, where each Mi is

a module in the software system

– Let G be the graph representing the relationships between the modules in S

– Generate every partition of set S

– Evaluate MQ for each partition

– The partition with the largest MQ is the optimal solution

• The algorithm works well for sets of up to 15 elements, beyond that the number of k-partitions becomes too large to enumerate in a reasonable timeframe

• Clearly, sub-optimal techniques must be employed for large sets


19

How many k-partitions arethere?

1 = 12 = 23 = 54 = 155 = 526 = 2037 = 8778 = 41409 = 2114710 = 115975

11 = 67857012 = 421359713 = 2764443714 = 19089932215 = 138295854516 = 1048014214717 = 8286486980418 = 68207680615919 = 583274220505720 = 51724158235372

• The following table illustrates the number of k-partitions of a system given that the system has N modules.


20

Sub-Optimal Modularization Strategy

• The search space required for enumerating all possible partitions is too large in most software systems– We need to develop a search strategy that

quickly discovers an acceptable sub-optimal clustering

• Generic Sub-Optimal AlgorithmConstruct a resource dependency graph G that represents the relationships between the modules in S.

Generate a uniformly distributed random clusterings of S. We use a combinatorial algorithm to accomplish this task because our sub-optimal techniques require the generation of many random clusterings.

Iteratively improve a randomly generated clustering, by measuring its MQ, until no further improvement is possible. This task is accomplished by heuristically moving modules in S between the generated clusters.

Repeat this process until an acceptable sub-optimal result it determined.


21

Neighboring Partition

• We need a way to improve a partitions MQ

• We define a partition NP to be a neighbor of a partition P if and only if:– NP is exactly the same as P except that a

single element of P is in a different cluster in partition NP

M1

M2

M3

M1

M2

M3

M1

M2

M3

M1

M2

M3

Original Partition Neighbor 1 Neighbor 2 Neighbor 3

MQ = -0.666 MQ = -0.25 MQ = -0.625 MQ = -0.625


22

Generic Sub-Optimal Algorithm




– Generate a random partition P of set S

– If possible, find a neighboring partition NP that has an improved MQ over P

– If an improved neighboring partition is found

• Let P = NP

– P is the sub-optimal solution

• A variety of algorithms for finding sub-optimal solutions are possible, depending on how “improved” is defined


23

Steepest-Ascent Hill Climbing (SAHC Algorithm)





– Repeat• Find the best neighboring partition BNP

that has MQ(BNP) > MQ(P)

• If an improved BNP is found such that MQ(BNP) > MQ(P)

– Let P = BNP

– Until no further “improved” BNP’s can be found


• BNP may be expensive to calculate– All neighboring partitions of P must be

examined


24

Next-Ascent Hill Climbing (NAHC) Algorithm

• Algorithm– Let S = {M1, M2, …, Mn}, where each Mi is a

module in the software system



– Repeat• Find a better neighboring partition bNP that has

MQ(bNP) > MQ(P)

• If an improved bNP is found such that MQ(bNP) > MQ(P)

– Let P = bNP

– Until no further “improved” BNP’s can be found


• A bNP is discovered by randomly searching the set of neighboring partitions until a partition with a higher MQ is found– Usually, not all NP’s will have to be examined


25

A Genetic Algorithm Framework

• Our experimentation with the SAHC and NAHC algorithms have shown that given an initial random starting partition that– The algorithms will converge to a local

maximum

– However, not all initial partitions converge to an acceptable result

• Therefore we must either:– Run the experiment many times using

different initial partitions and pick the experiment that results in the largest MQ

– Or, Devise an approach that works with a population of randomly generated initial partitions and concurrently improves them until all of the initial samples converge

• The partition in the final population with the largest MQ is the sub-optimal solution

• This approach lends itself to being implemented with a Genetic Algorithm


26

Genetic Algorithms

• Genetic algorithms were first developed by John Holland et. al. at the University of Michigan

• Genetic algorithms have been applied to many problems that involve exploring large search spaces

• Characteristics of GA’s– Combine survival-of-the-fittest techniques

with a structured and randomized information exchange

• Facilitates innovative algorithms that parallel the natural human selection process

• GA are more than a randomized search, instead, they exploit historical data to speculate new information that is expected to yield improved results


27

Genetic Search Sub-Optimal Clustering Algorithm

• Algorithm– Let S = {M1, M2, …, Mn}, where each Mi is a

module in the software system



– Repeat• Randomly select a percentage of partitions from the

population and improve them using the SAHC or NAHC technique

• Generate a new population (from the current one) by using a biased wheel that favors partitions with larger MQ

– Let P = bNP

– Until no improvement is seen for t generations, until the population has converged, or until the max. number of generations has been executed

– P in the final generation with the largest MQ is the sub-optimal solution


28

Agglomerative Clustering

• The prevous algorithms discovered subsystems based on the graph that was formed by recovering the relationships that existed in the source code components

• In most systems, however, we are interested in finding a hierarchy of subsystems that capture the higher-order relationships that exist in the software

• Wrapping our algorithms with an agglomerative clustering engine solves this problem


29

Agglomerative Clustering Algorithm

• Algorithm– Let S = {M1, M2, …, Mn}

– Let G be the resource dependency graph

– Let Q be a queue

– Repeat• Find a maximal partition (Pmax) of S using the

Optimal, SAHC or NAHC algorithm

• Save partition Pmax on Q

• Now let S = {C1, C2, …, Cn} where each Ci is a cluster in Pmax

• Build a new graph G by treating each cluster in Pmax as a single element. Furthermore if there is at least one edge between any two clusters in Pmax then there is an edge between their representative nodes in G

– Until Pmax has coalesced into a single cluster

– Q contains a hierarchy of partitions


30

Where to Get the Clustering Engine

• We have implemented and applied the clustering engines to many examples

• The system can be downloaded on the Web from the Drexel University Software Engineering Reasearch Group (SERG) hompeage at:– http://www.mcs.drexel.edu/~serg

• The clustering engine was developed using the Java 1.1 programming language


31

Compiler Example


32

Boxer (Autolayout Utility)Example

Documents

MCS680: Foundations Of Computer Science