Algorithm for finding one of the largest common subgraphs of two three-dimensional graph structures

Algorithm for Finding One of the Largest Common Subgraphs

of Two Three-Dimensional Graph Structures

Sumio Masuda

Faculty of Engineering, Kobe University, Kobe, Japan 657

Hiroyuki Yoshioka

Graduate School of Science and Technology, Kobe University, Kobe, Japan 657

Eiichi Tanaka

Faculty of Engineering, Kobe University, Kobe, Japan 657

SUMMARY

Given two connected graphs Ga = (Va, Ea) and

Gb = (Vb, Eb) with three-dimensional structures. Let

na = |Va|, ma = |Ea|, nb = |Vb|, and mb = |Eb|. Let the maxi-

mum order of a vertex in Ga(Gb) be la(lb). Initially this paper

offers a method to find a largest common subgraph of Ga

and Gb in O(lama2lbmblognb) time. Then, an algorithm with

O(na1.5nblognalognb) time is proposed for the case where

Ga is a planar structure with a three-dimensional structure

and satisfies the following two conditions. Condition 1: In

Ga and Gb, no vertex has order exceeding a specified

constant c. Condition 2: In Ga, no two adjacent edges are

located on the same straight-line. In particular, we show that

when Ga is a tree, and conditions 1 and 2 are satisfied, a

largest common substructure of Ga and Gb can be found in

O (nanblognalognb) time. ©1998 Scripta Technica, Electron

Comm Jpn Pt 3, 81(9): 48�53, 1998

Key words: Graph; largest common subgraph; algo-

rithm; computational complexity.

1. Introduction

The largest common subgraph problem is a problem

in which a connected subgraph with the largest number of

edges is to be determined among all subgraphs common to

a given collection of graphs. This problem is generally

NP-hard, even if the number of given graphs is restricted to

two [1].

Let a vertex correspond to an atom and an edge

correspond to a bond between atoms. Then, a graph represents

a chemical compound. If a common subgraph can be found

for multiple given compounds with similar properties, this

is useful in locating the structure that is responsible for

those properties [2]. From this perspective, various studies

have been made on the largest common subgraph problem

[3�5]. One study notes that many graphs corresponding to

compounds are trees, and a vertex in a graph representing

a compound usually has only a small order, and then dis-

cusses the largest common subgraph problem for the case

where the class of the graphs is restricted [6�8].

Since a compound generally has a three-dimensional

structure, there is a possibility that a more effective method

CCC1042-0967/98/090048-06

© 1998 Scripta Technica

Electronics and Communications in Japan, Part 3, Vol. 81, No. 9, 1998Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J80-A, No. 5, May 1997, pp. 803�808

48

can be developed if the structural information is adequately

utilized. With this viewpoint, several methods have been

proposed that utilize the inter-atomic distance to determine

the largest subgraph common to two compounds [9, 10].

In this paper we discuss the problem of determining

a largest common subgraph of a graph with three-dimen-

sional structure (i.e., a graphs whose vertices have coordi-

nates). In Ref. 11 this problem is discussed in terms of an

approximation for the case where the number of graphs is

arbitrary. In this paper we present an algorithm for the case

where the number of graphs is 2.

In the first half of this paper, we present a method that

obtains a largest connected common subgraph of two con-

nected graphs Ga = (Va, Ea) and Gb = (Vb, Eb) with three-

dimensional structures in O(lama2lbmblognb) time where

ma = |Ea|, nb = |Vb|, mb = |Eb|, and la(lb) is the maximum

order of a vertex in Ga(Gb). According to Ref. 5, most of the graphs representing

the compounds form planar graphs. The latter half of this

paper considers the case where Ga is a tree or a planar graph

with a three-dimensional structure and satisfies the follow-

ing two conditions.

Condition 1: In Ga and Gb, no vertex has an order

exceeding a constant c.

Condition 2: In Ga, no two adjacent edges are located

on a straight line.

For this case, an algorithm with a smaller asymptotic

time complexity is proposed.

This paper is organized as follows. First, section 2

gives several definitions. In section 3 we present an algo-

rithm for the case where Ga is an arbitrary graph with

three-dimensional structure. Section 4 presents an algo-

rithm for the case where Ga is a planar graph or a tree and

satisfies conditions 1 and 2. In section 5 we discuss the case

where the largest common subgraph need not be strict, and

a discrepancy is tolerated to some extent between the sub-

graphs. Finally, section 6 summarizes the results in this

paper.

2. Definitions

A graph that can be drawn in a plane with no two

edges intersecting or contacting except at a common end

point is called a planar graph. A connected graph without a

cycle is called a tree. Every tree is a planar graph.

Let (V, E) be a simple graph, i.e., a graph that has

neither a self-loop or multiple edges. Place vertices of V in

three-dimensional space under a mapping p:V ® R3 and

represent the edges by straight-line segments between end

points. The resulting structure is called a three-dimensional

graph structure and is represented by (V, E, p) (p is omitted

if there is no danger of confusion). It is assumed that no two

vertices are placed at the same point. It is also assumed that

no edge passes through any vertex other than its end points,

and no two edges intersect or contact except at the shared

end points. Among three-dimensional graph structures, a

structure for which the graph (V, E) is a planar graph is

written 3DPG, and a structure for which (V, E) is a tree is

written 3DT.

In the three-dimensional graph structure

G = (V, E, p), the length of an edge e Î E is written leng(e).

For two adjacent edges e1 and e2, the angle formed between

them is written as angle(e1, e2). When e1 and e2 form a

straight line, angle(e1, e2) = 180°.

Throughout this paper, let Ga = (Va, Ea, pa) and

Gb = (Vb, Eb, pb) be connected three-dimensional graph

structures. Also let na = |Va|, ma = |Ea|, nb = |Vb|, and

mb = |Eb|. Let the maximum order of a vertex in Ga(Gb) be

la(lb).Among the connected common subgraphs of Ga and

Gb, the one with the largest number of edges is called the

largest common subgraph. This paper discusses the prob-

lem of finding a largest common subgraph of Ga and Gb. It

is easy to solve this problem for the case where na < 3 or

nb < 3. In what follows, we assume that na ³ 3 and nb ³ 3.

3. An Algorithm to Find the Largest

Common Subgraph of a Three-Dimensional

Graph Structure

Among the common connected subgraphs of Ga and

Gb, the one for which any two adjacent two edges are

located on a straight line is called a straight-line common

subgraph. Among all straight-line common subgraphs, the

one with the largest number of edges is called a largest

straight-line common subgraph. It is possible to find a

largest straight-line common subgraph of Ga and Gb in

O(lama + lbmb + mamb) time (the proof is omitted).

Among the common connected subgraphs of Ga and

Gb, the one that is not a straight-line structure, i.e., the one

which contains two adjacent edges not on a straight line, is

called a non-straight-line common subgraph. Among all

non-straight-line common subgraphs, the one with a largest

number of edges is called a largest non-straight-line com-

mon subgraph. Among the largest straight-line common

subgraphs and the largest non-straight-line common sub-

graphs, the one with the greatest number of edges is obvi-

ously one of the largest common subgraphs of Ga and Gb.

The method to find a largest non-straight-line com-

mon subgraph of Ga and Gb is, briefly, as follows. Select

from Ga a vertex v with order not less than 2 and two edges

e1 and e2 adjacent to v. Similarly, select from Gb a vertex w

with order not less than 2 and two adjacent edges f1 and f2.

If the following three conditions are satisfied, v is placed on

49

w, e1 is mapped to f1 and e2 is mapped to f2, by a translation

and a rotation.

When the above transformation is applied to Ga,

the coordinates are also uniquely determined for all

vertices other than v. Thus, the subgraph that is com-

mon to Ga and Gb can be easily determined. The proce-

dure that executes the above process is written Match

(Ga, Gb, v, e1, e2, w, f1, f2). This procedure is applied to all

combinations of v, e1, e2, w, f1, f2. To determine a subgraph

with the largest number of edges among the common con-

nected subgraphs, a largest non-straight-line common sub-

graph of Ga and Gb is found.

Although the details are omitted, by handling the

edge set of Gb as an adequate data set, the procedure Match

can be executed in O(ma log nb) time for one execution.

Since the total number of combinations of v, e1, e2,

w, f1, f2 is O(lamalbmb), the time required to find a largest

non-straight-line common subgraph is O(lama2lbmblognb) .

Thus, the following theorem is obtained.

Theorem 1. A largest common subgraph of two

three-dimensional graph structures Ga and Gb can be found

in O(lama2lbmblognb) time. "

Whether or not Ga and Gb contain the same subgraph

is an especially important problem in substructure searches

of a chemical compound database [2]. The following

lemma was derived for this purpose (the proof is omitted).

Lemma 1. It is possible to decide whether or not

Gb contains the same subgraph as Ga in O(malbmblognb)time. "

4. Algorithm for a Planar Graph Ga

Initially in this section, we consider the case where

Ga is a 3DPG and the following two conditions hold.

Condition 1: In Ga and Gb, no vertex has order

exceeding a specified constant c.

Condition 2: In Ga, no two adjacent edges are located

on a straight line.

It follows from Condition 1 that ma = O(na) and

mb = O(nb). In this case, a largest common subgraph of Ga

and Gb can be found in O(na2nblognb) time, by executing the

procedure Match described in section 3 for

O(na1.5nblognalognb) times in total. It is shown in the fol-

lowing that the time complexity can be reduced

O(na1.5mnblognalognb) by iteratively applying the graph

separation process. The case where Ga is 3DT is discussed

at the end of this section.

4.1. Separation of a graph

The following lemma is known for planar graphs

[12].

Lemma 2. Let G = (V, E) be an arbitrary planar

graph with n vertices. Then, the vertex set V can be sepa-

rated into subsets V1, V2, and S that satisfy the following

three conditions.

(i) V1 È V2 È S = V and V1, V2, S are pairwise dis-

joint.

(ii) There is no edge in G that connects a vertex in

V1 and a vertex in V2.

(iii) |V1| £ 2n / 3, |V2| £ 2n /3, and |S| £ 2Ö`2n. "

In this case, the vertex set S is called a separator of G.

Each of the two graphs G1 and G2 defined as follows is

called a separated component with respect to S.

Figure 1 shows a simple case of a planar graph and a

separator. For a graph G, a separator can be determined in

O(n) time [12]. "

The algorithm proposed in this section is based on the

following lemma.

Lemma 3. Assume that the largest common sub-

graph of two three-dimensional graph structures G and G¢has two or more edges. Assume that G is 3DPG. Consider

an arbitrary separator S, and let the separated components

with respect to S be G1 and G2. Let an arbitrary largest

common connected subgraph of G1 and G¢ be C1, and of

Fig. 1. An example of a separator.

50

G2 and G¢ be C2. Then, consider the common connected

subgraphs G and G¢. Among those containing a vertex of S

and two or more edges adjacent to that vertex, let an

arbitrary graph with the largest number of edges be CS.

Then, one of C1, C2, and CS is a largest common subgraph

of G and G¢.Proof. Let an arbitrary largest common subgraph of

G and G¢ be C. Let the number of its edges be a. Let the

numbers of edges of C1, C2, and CS be a1, a2, and aS,

respectively. Obviously, a ³ max{a1, a2, aS}.

If C contains a vertex v of S and two or more edges

adjacent to that vertex, a £ aS. If C does not contain such

a vertex v and two adjacent edges, it follows, from a ³ 2

and the connectivity of C, that C does not contain an edge

connecting two vertices of S. By the definitions of the

separator and the separated component, C in this case is a

largest common connected subgraph of G1 and G¢, or G2

and G¢. Thus, a £ max(a1, a2). I t follows that

a = max(a1, a2, aS). In other words, among C1, C2, and

CS that with the largest number of edges is the largest

common subgraph of G and G¢. "

For G defined as in Lemma 3, assume that no two

adjacent edges are located on a straight line. In this case, in

order to determine CS, it suffices to apply the procedure

Match described in section 3 to all combinations of a vertex

and two adjacent edges of S, as well as a vertex of G¢ and

two adjacent edges. In order to determine C1 and C2, it

suffices to find a largest common subgraph for each con-

nected component X (with three or more vertices) of G1 and

G2, and G¢. Each X is obviously 3DPG, and it is possible to

apply Lemma 3 recursively to X and G¢.

4.2. Algorithm

Based on the reasoning in the previous section, our

algorithm for the case where Ga is 3PDG and Conditions 1

and 2 hold is constructed as follows.

Initially, the existence of edges e Î Ea and f Î Eb

with the same length is examined. If no such pair of edges

exist, the algorithm ends, deciding that the largest common

subgraph of Ga and Gb is a vertex. If there are two edges of

the same length, they are recorded. Then, the common

subgraph containing two or more edges is sought. For this

purpose, the following procedure is executed, and among

all derived common subgraphs, one with the largest number

of edges is determined.

Procedure Find_CSS

(1) Let G_SET0 := {Ga}, j := 0, M := 89.

(2) If G_SETj = Æ, the algorithm ends.

(3) Let G_SETj := {G1j , G2

j , . . . , Gkj

j } and

G_SETj+1: Æ.

(4) For each i = 1, 2, . . . , kj, if the number of verti-

ces in G ij is less than M, determine the largest common

subgraph of G ij and G b. If the number is not less than M,

execute steps (4-1) and (4-2).

(4-1) Determine a separator of G ij , and let it be S i

j .

The separated components of G ij with respect to S i

j are

determined. Each of these is separated into connected com-

ponents and all connected components with three or more

vertices are added to the set G_SETj+1.

(4-2) Among the common subgraphs that contain a

vertex of Sij and two edges adjacent to that vertex, one with

the largest number of edges is determined.

(5) Let j := j + 1 and go to step (2). "

4.3. Time complexity

Among the sets G_SET0, G_SET1, G_SET2, . . . con-

structed by procedure Find_CSS, let the first empty set be

G_SETK. For a planar graph with n vertices, the number of

vertices in a separated component is at most 2n /3 + 2Ö`2n .

When n ³ 89, there holds 2n /3 + 2Ö`2n £ 29n / 30. Conse-

quently, if na ³ 89, any graph belonging to G_SETj has no

more than na ´ (29 /30)j vertices, for j = 0, 1, . . . . Let

na ´ (29 /30)J = 88, then K = O(log na) since k £ éJj + 1.

Let j be an integer such that 0 £ j £ K. For each

i = 1, 2, . . . , kj, let G ij = (V i

j, E ij). Since no two graphs of

G_SETj share an edge, and Ga is a planar graph,

Si=1kj |E i

j| £ |Ea| < 3na holds. Furthermore, Si=1kj |V i

j | = O(na),since each G i

j is connected.

The execution time of procedure Find_CSS is domi-

nated by step (4). In the following, the time spent in this

step is evaluated for each integer J (0 £ j < K). Among the

graphs of G_SETj, let the set of graphs with 89 or less

vertices be Yj. In order to determine the largest common

substructure of each graph Gij of Yj and G, the procedure

Match is executed O (|V ij | × nv) = O (nb) times. The time for

an execution is O (|Eij | × log nb). Consequently, the process-

ing time for all graphs of Yj is

Since Si=1kj |Vi

j | = O (na), when the separator is derived

by the method in Ref. 12, the sum of complexities required

in step (4-1) is O(na) for each j.

In step (4-2), the procedure Match is executed

O(|Sij|×nb) = 0(Öna×nb) times for each j. For each j, the sum

of the times required for step (4-2) is

Thus, step (4) is executed in O (na1.5 nb log nb) time for each

j. As is already pointed out, K = O (log na). Consequently,

the time complexity for the entire procedure FIND_CSS is

O (na1.5 nb log na log nb).

51

Theorem 2. When Ga is 3DPG and Conditions 1

and 2 hold, it is possible to find a largest common subgraph

of Ga and Gb in O(na1.5nblognalognb) time. "

4.4. Case where Ga is a tree

It is known that a tree has a separator composed of a

single vertex, and it is possible to find such a separator in

time proportional to the number of vertices in the tree [13].

Consequently, when Ga is 3DT and Conditions 1 and 2 hold,

the algorithm in section 4.2 can be applied. In this case, even

if the constant M is set to 6, it is possible to guarantee that

the number of iterations K in steps (3) and (4) of procedure

Find_CSS is O (logna) (since 2n /3 + 1 £ 5n /6 holds for

n ³ 6). Also, since |S ij | = 1 (1 £ i £ kj) for each j, the time

required for step (4-2) can be reduced to O (nanblognb). As

a result, the time complexity of the algorithm in this case is

O (nanb log na lognb).Theorem 3. When Ga is 3DT and Conditions 1 and

2 hold, it is possible to find a largest common subgraph of

Ga and Gb in time O (nanblognalognb). "

5. Case Where Error Is Tolerated

In the applications such as the extraction of the com-

mon subgraph between chemical structures, it is important

to have available an algorithm, not for the strictly largest

common subgraph, but for a subgraph in which errors are

tolerated to some extent between the corresponding verti-

ces. For the largest non-straight-line common subgraph

algorithm in section 3, as well as for the algorithm in section

4, the procedure Match is used. When an error can be

tolerated, this procedure must be modified. The following

offers one possible approach.

Suppose that a vertex v and two edges e1, e2 adjacent

v are selected from Ga, and a vertex w and two adjacent

edges f1, f2 are selected from Gb. Then, the condition for

overlapping these is

where tl and ta are the tolerances for the length and the angle,

respectively.

When v, e1, e2, w, f1, and f2 satisfy all of these condi-

tions, they are overlapped by the method in 11.1.1 of Ref.

10. In that method, v is overlapped with w. A rotation is

applied so that the plane containing both e1 and e2 coincides

with the plane containing both f1 and f2. Then, a rotation is

applied with the straight line perpendicular to the above

plane and passing through v as the axis, so that edges e1 and

f1 overlap. The common subgraph is sought after the above

transformation. For two edges e = (vg, vgg) Î Ea and

f = (wg, wgg) Î Eb, if the distance between v¢ and w¢, as well

as the distance between v² and w² are both within a certain

threshold td, it is accepted that e and f coincide.

The above method is very simple, but the following

problem arises since v and w are overlapped with zero error,

especially when the method is applied to the procedure

Find_CSS in section 4.2. As was already described, when

a separator S is determined in Ga or its subgraph in this

procedure, the common subgraph is sought in step (4-2) by

overlapping the vertices of S with the corresponding verti-

ces of Ga. When this step is complete, for any vertex of S it

is impossible to derive a common subgraph that contains

two edges connected to that vertex. As a result, it may not

be possible to derive a largest common subgraph, in which

a vertex v of S containing two edges adjacent to v coincides

with a vertex of Gb within an error less than td even if such

a subgraph exists. In order to cope with this problem, further

elaboration is required.

6. Conclusions

We offer a method that can find a largest common

connected subgraph of two connected graphs Ga and Gb with

three-dimensional structures. An algorithm with smaller

asymptotic time complexity is presented for the case where

Ga is a tree or a planar graph with three-dimensional struc-

ture and the following two conditions are satisfied.

Condition 1: No vertex in Ga or Gb has order exceed-

ing a certain constant c.

Condition 2: No two adjacent edges in Ga are located

on a straight line. By a simple extension, the methods

presented can be applied also to the case where the vertices

and edges are labelled. When an error can be tolerated

between the corresponding vertices, further investigation is

required, as was pointed out in section 5. It is also left for

future study, using computer experiments, to attempt an

application to the extraction of common subgraph between

chemical structures.

Acknowledgement. The authors thank the re-

viewer for useful comments.

REFERENCES

1. M.R. Garey and D.S. Johnson. Computers and Intrac-

tability�A Guide to the Theory of NP-Complete-

ness. Freeman, San Francisco, CA (1979).

52

2. S. Ono (ed.). Computed Chemistry. Maruzen Co.

(1988).

3. M.M. Cone, R. Venkataraghavan, and F.W. McLaf-

ferty. Molecular structure comparison program for

the identification of maximal common substructures.

J. Am. Chem. Soc., 99, No. 23, pp. 7668�7671 (Nov.

1977).

4. Y. Takahashi, Y. Satoh, H. Suzuki, and S. Sasaki.

Recognition of largest common structural fragment

among a variety of chemical structures. Analytical

Sciences, 3, pp. 23�28 (Feb. 1987).

5. D.M. Bayada, R.W. Simpson, and A.P. Johnson. An

algorithm for the multiple common subgraph prob-

lem. J. Chem. Inf. Comput. Sci., 32, No. 6, pp.

680�685 (1992).

6. T. Akutsu. An RNC algorithm for finding a largest

common subtree of two trees. I.E.I.C.E. Trans. Inf. &

Syst., E75-D, No. 1, pp. 95�101 (Jan. 1992).

7. T. Akutsu. A polynomial time algorithm for finding

a largest common subgraph of almost trees of

bounded degree. I.E.I.C.E. Trans. Fundamentals,

E76-A, No. 9, pp. 1488�1493 (Sept. 1993).

8. S. Masuda, I. Mori, and E. Tanaka. An algorithm to

find the largest common subgraph of two trees. Trans.

I.E.I.C.E.J. (A), J77, No. 3, pp. 460�470 (March

1994).

9. Y. Takahashi, S. Maeda, and S. Sasaki. Automated

recognition of common geometrical patterns among

a variety of three-dimensional molecular structures.

Analytica Chimica Acta, 200, pp. 363�377 (1987).

10. J.-P. Doucet and J. Weber. Computer-Aided Molecu-

lar Design. Academic Press, London (1996).

11. T. Akutsu. On approximability of the largest common

point set of multiple point sets. Information Process-

ing Society of Japan Research Reports, AL33-9 (May

1993).

12. R.J. Lipton and R.E. Tarjan. A separator theorem for

planar graphs. SIAM J. Appl. Math., 36, No. 2, pp.

177�189 (April 1979).

13. T. Nishizeki and N. Chiba. Planar graphs�theory

and algorithms. Annals of Disc. Math., 32, North-

Holland, Amsterdam (1988).

AUTHORS (from left to right)

Sumio Masuda (member) graduated in 1979 from Dept. Inf., Fac. Eng. Sci., Osaka Univ. Completed 2nd Half of Doctor�s

Program in 1984 Grad. Sch. Dr. Eng. Served as Assistant and Lecturer, Dept. Inf., Fac. Eng. Sci., Osaka Univ. Since 1991,

Assoc. Prof., Dept. Elect. Elect. Eng., Fac. Eng., Kobe Univ. Mostly engaged in research on algorithm design and graph theory.

Member, IEEE and Inf. Proc. Soc.

Hiroyuki Yoshioka (student member) graduated in 1996 from Dept. Elect. Elect. Eng., Fac. Eng., Kobe Univ. Presently,

Grad Student in 1st Half of Doctor�s Program. Interested in design of graphic algorithm.

Eiichi Tanaka (member) graduated in 1962 from Dept. Electrical Eng., Fac. Eng., Osaka Pref. Univ. Satisfied credit

requirement and graduated from Doctor�s Program in 1967, Grad. Sch., Osaka Univ., and became Assistant, Dept. Electrical

Eng., Fac. Eng., Osaka Pref. Univ. Prof. 1977 Dept. Inf., Fac. Eng., Utsunomiya Univ. Presently, Prof. Dept. Elect. Elect. Eng.,

Fac. Eng., Kobe Univ. Engaged in research on graph theory and file composition method. Dr. Eng. Member, IEEE, Inf. Proc.

Soc. and Math. Soc.

53

Documents

Algorithm for finding one of the largest common subgraphs of two three-dimensional graph structures