Upload
sumio-masuda
View
212
Download
0
Embed Size (px)
Citation preview
Algorithm for Finding One of the Largest Common Subgraphs
of Two Three-Dimensional Graph Structures
Sumio Masuda
Faculty of Engineering, Kobe University, Kobe, Japan 657
Hiroyuki Yoshioka
Graduate School of Science and Technology, Kobe University, Kobe, Japan 657
Eiichi Tanaka
Faculty of Engineering, Kobe University, Kobe, Japan 657
SUMMARY
Given two connected graphs Ga = (Va, Ea) and
Gb = (Vb, Eb) with three-dimensional structures. Let
na = |Va|, ma = |Ea|, nb = |Vb|, and mb = |Eb|. Let the maxi-
mum order of a vertex in Ga(Gb) be la(lb). Initially this paper
offers a method to find a largest common subgraph of Ga
and Gb in O(lama2lbmblognb) time. Then, an algorithm with
O(na1.5nblognalognb) time is proposed for the case where
Ga is a planar structure with a three-dimensional structure
and satisfies the following two conditions. Condition 1: In
Ga and Gb, no vertex has order exceeding a specified
constant c. Condition 2: In Ga, no two adjacent edges are
located on the same straight-line. In particular, we show that
when Ga is a tree, and conditions 1 and 2 are satisfied, a
largest common substructure of Ga and Gb can be found in
O (nanblognalognb) time. ©1998 Scripta Technica, Electron
Comm Jpn Pt 3, 81(9): 48�53, 1998
Key words: Graph; largest common subgraph; algo-
rithm; computational complexity.
1. Introduction
The largest common subgraph problem is a problem
in which a connected subgraph with the largest number of
edges is to be determined among all subgraphs common to
a given collection of graphs. This problem is generally
NP-hard, even if the number of given graphs is restricted to
two [1].
Let a vertex correspond to an atom and an edge
correspond to a bond between atoms. Then, a graph represents
a chemical compound. If a common subgraph can be found
for multiple given compounds with similar properties, this
is useful in locating the structure that is responsible for
those properties [2]. From this perspective, various studies
have been made on the largest common subgraph problem
[3�5]. One study notes that many graphs corresponding to
compounds are trees, and a vertex in a graph representing
a compound usually has only a small order, and then dis-
cusses the largest common subgraph problem for the case
where the class of the graphs is restricted [6�8].
Since a compound generally has a three-dimensional
structure, there is a possibility that a more effective method
CCC1042-0967/98/090048-06
© 1998 Scripta Technica
Electronics and Communications in Japan, Part 3, Vol. 81, No. 9, 1998Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J80-A, No. 5, May 1997, pp. 803�808
48
can be developed if the structural information is adequately
utilized. With this viewpoint, several methods have been
proposed that utilize the inter-atomic distance to determine
the largest subgraph common to two compounds [9, 10].
In this paper we discuss the problem of determining
a largest common subgraph of a graph with three-dimen-
sional structure (i.e., a graphs whose vertices have coordi-
nates). In Ref. 11 this problem is discussed in terms of an
approximation for the case where the number of graphs is
arbitrary. In this paper we present an algorithm for the case
where the number of graphs is 2.
In the first half of this paper, we present a method that
obtains a largest connected common subgraph of two con-
nected graphs Ga = (Va, Ea) and Gb = (Vb, Eb) with three-
dimensional structures in O(lama2lbmblognb) time where
ma = |Ea|, nb = |Vb|, mb = |Eb|, and la(lb) is the maximum
order of a vertex in Ga(Gb). According to Ref. 5, most of the graphs representing
the compounds form planar graphs. The latter half of this
paper considers the case where Ga is a tree or a planar graph
with a three-dimensional structure and satisfies the follow-
ing two conditions.
Condition 1: In Ga and Gb, no vertex has an order
exceeding a constant c.
Condition 2: In Ga, no two adjacent edges are located
on a straight line.
For this case, an algorithm with a smaller asymptotic
time complexity is proposed.
This paper is organized as follows. First, section 2
gives several definitions. In section 3 we present an algo-
rithm for the case where Ga is an arbitrary graph with
three-dimensional structure. Section 4 presents an algo-
rithm for the case where Ga is a planar graph or a tree and
satisfies conditions 1 and 2. In section 5 we discuss the case
where the largest common subgraph need not be strict, and
a discrepancy is tolerated to some extent between the sub-
graphs. Finally, section 6 summarizes the results in this
paper.
2. Definitions
A graph that can be drawn in a plane with no two
edges intersecting or contacting except at a common end
point is called a planar graph. A connected graph without a
cycle is called a tree. Every tree is a planar graph.
Let (V, E) be a simple graph, i.e., a graph that has
neither a self-loop or multiple edges. Place vertices of V in
three-dimensional space under a mapping p:V ® R3 and
represent the edges by straight-line segments between end
points. The resulting structure is called a three-dimensional
graph structure and is represented by (V, E, p) (p is omitted
if there is no danger of confusion). It is assumed that no two
vertices are placed at the same point. It is also assumed that
no edge passes through any vertex other than its end points,
and no two edges intersect or contact except at the shared
end points. Among three-dimensional graph structures, a
structure for which the graph (V, E) is a planar graph is
written 3DPG, and a structure for which (V, E) is a tree is
written 3DT.
In the three-dimensional graph structure
G = (V, E, p), the length of an edge e Î E is written leng(e).
For two adjacent edges e1 and e2, the angle formed between
them is written as angle(e1, e2). When e1 and e2 form a
straight line, angle(e1, e2) = 180°.
Throughout this paper, let Ga = (Va, Ea, pa) and
Gb = (Vb, Eb, pb) be connected three-dimensional graph
structures. Also let na = |Va|, ma = |Ea|, nb = |Vb|, and
mb = |Eb|. Let the maximum order of a vertex in Ga(Gb) be
la(lb).Among the connected common subgraphs of Ga and
Gb, the one with the largest number of edges is called the
largest common subgraph. This paper discusses the prob-
lem of finding a largest common subgraph of Ga and Gb. It
is easy to solve this problem for the case where na < 3 or
nb < 3. In what follows, we assume that na ³ 3 and nb ³ 3.
3. An Algorithm to Find the Largest
Common Subgraph of a Three-Dimensional
Graph Structure
Among the common connected subgraphs of Ga and
Gb, the one for which any two adjacent two edges are
located on a straight line is called a straight-line common
subgraph. Among all straight-line common subgraphs, the
one with the largest number of edges is called a largest
straight-line common subgraph. It is possible to find a
largest straight-line common subgraph of Ga and Gb in
O(lama + lbmb + mamb) time (the proof is omitted).
Among the common connected subgraphs of Ga and
Gb, the one that is not a straight-line structure, i.e., the one
which contains two adjacent edges not on a straight line, is
called a non-straight-line common subgraph. Among all
non-straight-line common subgraphs, the one with a largest
number of edges is called a largest non-straight-line com-
mon subgraph. Among the largest straight-line common
subgraphs and the largest non-straight-line common sub-
graphs, the one with the greatest number of edges is obvi-
ously one of the largest common subgraphs of Ga and Gb.
The method to find a largest non-straight-line com-
mon subgraph of Ga and Gb is, briefly, as follows. Select
from Ga a vertex v with order not less than 2 and two edges
e1 and e2 adjacent to v. Similarly, select from Gb a vertex w
with order not less than 2 and two adjacent edges f1 and f2.
If the following three conditions are satisfied, v is placed on
49
w, e1 is mapped to f1 and e2 is mapped to f2, by a translation
and a rotation.
When the above transformation is applied to Ga,
the coordinates are also uniquely determined for all
vertices other than v. Thus, the subgraph that is com-
mon to Ga and Gb can be easily determined. The proce-
dure that executes the above process is written Match
(Ga, Gb, v, e1, e2, w, f1, f2). This procedure is applied to all
combinations of v, e1, e2, w, f1, f2. To determine a subgraph
with the largest number of edges among the common con-
nected subgraphs, a largest non-straight-line common sub-
graph of Ga and Gb is found.
Although the details are omitted, by handling the
edge set of Gb as an adequate data set, the procedure Match
can be executed in O(ma log nb) time for one execution.
Since the total number of combinations of v, e1, e2,
w, f1, f2 is O(lamalbmb), the time required to find a largest
non-straight-line common subgraph is O(lama2lbmblognb) .
Thus, the following theorem is obtained.
Theorem 1. A largest common subgraph of two
three-dimensional graph structures Ga and Gb can be found
in O(lama2lbmblognb) time. "
Whether or not Ga and Gb contain the same subgraph
is an especially important problem in substructure searches
of a chemical compound database [2]. The following
lemma was derived for this purpose (the proof is omitted).
Lemma 1. It is possible to decide whether or not
Gb contains the same subgraph as Ga in O(malbmblognb)time. "
4. Algorithm for a Planar Graph Ga
Initially in this section, we consider the case where
Ga is a 3DPG and the following two conditions hold.
Condition 1: In Ga and Gb, no vertex has order
exceeding a specified constant c.
Condition 2: In Ga, no two adjacent edges are located
on a straight line.
It follows from Condition 1 that ma = O(na) and
mb = O(nb). In this case, a largest common subgraph of Ga
and Gb can be found in O(na2nblognb) time, by executing the
procedure Match described in section 3 for
O(na1.5nblognalognb) times in total. It is shown in the fol-
lowing that the time complexity can be reduced
O(na1.5mnblognalognb) by iteratively applying the graph
separation process. The case where Ga is 3DT is discussed
at the end of this section.
4.1. Separation of a graph
The following lemma is known for planar graphs
[12].
Lemma 2. Let G = (V, E) be an arbitrary planar
graph with n vertices. Then, the vertex set V can be sepa-
rated into subsets V1, V2, and S that satisfy the following
three conditions.
(i) V1 È V2 È S = V and V1, V2, S are pairwise dis-
joint.
(ii) There is no edge in G that connects a vertex in
V1 and a vertex in V2.
(iii) |V1| £ 2n / 3, |V2| £ 2n /3, and |S| £ 2Ö`2n. "
In this case, the vertex set S is called a separator of G.
Each of the two graphs G1 and G2 defined as follows is
called a separated component with respect to S.
Figure 1 shows a simple case of a planar graph and a
separator. For a graph G, a separator can be determined in
O(n) time [12]. "
The algorithm proposed in this section is based on the
following lemma.
Lemma 3. Assume that the largest common sub-
graph of two three-dimensional graph structures G and G¢has two or more edges. Assume that G is 3DPG. Consider
an arbitrary separator S, and let the separated components
with respect to S be G1 and G2. Let an arbitrary largest
common connected subgraph of G1 and G¢ be C1, and of
Fig. 1. An example of a separator.
50
G2 and G¢ be C2. Then, consider the common connected
subgraphs G and G¢. Among those containing a vertex of S
and two or more edges adjacent to that vertex, let an
arbitrary graph with the largest number of edges be CS.
Then, one of C1, C2, and CS is a largest common subgraph
of G and G¢.Proof. Let an arbitrary largest common subgraph of
G and G¢ be C. Let the number of its edges be a. Let the
numbers of edges of C1, C2, and CS be a1, a2, and aS,
respectively. Obviously, a ³ max{a1, a2, aS}.
If C contains a vertex v of S and two or more edges
adjacent to that vertex, a £ aS. If C does not contain such
a vertex v and two adjacent edges, it follows, from a ³ 2
and the connectivity of C, that C does not contain an edge
connecting two vertices of S. By the definitions of the
separator and the separated component, C in this case is a
largest common connected subgraph of G1 and G¢, or G2
and G¢. Thus, a £ max(a1, a2). I t follows that
a = max(a1, a2, aS). In other words, among C1, C2, and
CS that with the largest number of edges is the largest
common subgraph of G and G¢. "
For G defined as in Lemma 3, assume that no two
adjacent edges are located on a straight line. In this case, in
order to determine CS, it suffices to apply the procedure
Match described in section 3 to all combinations of a vertex
and two adjacent edges of S, as well as a vertex of G¢ and
two adjacent edges. In order to determine C1 and C2, it
suffices to find a largest common subgraph for each con-
nected component X (with three or more vertices) of G1 and
G2, and G¢. Each X is obviously 3DPG, and it is possible to
apply Lemma 3 recursively to X and G¢.
4.2. Algorithm
Based on the reasoning in the previous section, our
algorithm for the case where Ga is 3PDG and Conditions 1
and 2 hold is constructed as follows.
Initially, the existence of edges e Î Ea and f Î Eb
with the same length is examined. If no such pair of edges
exist, the algorithm ends, deciding that the largest common
subgraph of Ga and Gb is a vertex. If there are two edges of
the same length, they are recorded. Then, the common
subgraph containing two or more edges is sought. For this
purpose, the following procedure is executed, and among
all derived common subgraphs, one with the largest number
of edges is determined.
Procedure Find_CSS
(1) Let G_SET0 := {Ga}, j := 0, M := 89.
(2) If G_SETj = Æ, the algorithm ends.
(3) Let G_SETj := {G1j , G2
j , . . . , Gkj
j } and
G_SETj+1: Æ.
(4) For each i = 1, 2, . . . , kj, if the number of verti-
ces in G ij is less than M, determine the largest common
subgraph of G ij and G b. If the number is not less than M,
execute steps (4-1) and (4-2).
(4-1) Determine a separator of G ij , and let it be S i
j .
The separated components of G ij with respect to S i
j are
determined. Each of these is separated into connected com-
ponents and all connected components with three or more
vertices are added to the set G_SETj+1.
(4-2) Among the common subgraphs that contain a
vertex of Sij and two edges adjacent to that vertex, one with
the largest number of edges is determined.
(5) Let j := j + 1 and go to step (2). "
4.3. Time complexity
Among the sets G_SET0, G_SET1, G_SET2, . . . con-
structed by procedure Find_CSS, let the first empty set be
G_SETK. For a planar graph with n vertices, the number of
vertices in a separated component is at most 2n /3 + 2Ö`2n .
When n ³ 89, there holds 2n /3 + 2Ö`2n £ 29n / 30. Conse-
quently, if na ³ 89, any graph belonging to G_SETj has no
more than na ´ (29 /30)j vertices, for j = 0, 1, . . . . Let
na ´ (29 /30)J = 88, then K = O(log na) since k £ éJj + 1.
Let j be an integer such that 0 £ j £ K. For each
i = 1, 2, . . . , kj, let G ij = (V i
j, E ij). Since no two graphs of
G_SETj share an edge, and Ga is a planar graph,
Si=1kj |E i
j| £ |Ea| < 3na holds. Furthermore, Si=1kj |V i
j | = O(na),since each G i
j is connected.
The execution time of procedure Find_CSS is domi-
nated by step (4). In the following, the time spent in this
step is evaluated for each integer J (0 £ j < K). Among the
graphs of G_SETj, let the set of graphs with 89 or less
vertices be Yj. In order to determine the largest common
substructure of each graph Gij of Yj and G, the procedure
Match is executed O (|V ij | × nv) = O (nb) times. The time for
an execution is O (|Eij | × log nb). Consequently, the process-
ing time for all graphs of Yj is
Since Si=1kj |Vi
j | = O (na), when the separator is derived
by the method in Ref. 12, the sum of complexities required
in step (4-1) is O(na) for each j.
In step (4-2), the procedure Match is executed
O(|Sij|×nb) = 0(Öna×nb) times for each j. For each j, the sum
of the times required for step (4-2) is
Thus, step (4) is executed in O (na1.5 nb log nb) time for each
j. As is already pointed out, K = O (log na). Consequently,
the time complexity for the entire procedure FIND_CSS is
O (na1.5 nb log na log nb).
51
Theorem 2. When Ga is 3DPG and Conditions 1
and 2 hold, it is possible to find a largest common subgraph
of Ga and Gb in O(na1.5nblognalognb) time. "
4.4. Case where Ga is a tree
It is known that a tree has a separator composed of a
single vertex, and it is possible to find such a separator in
time proportional to the number of vertices in the tree [13].
Consequently, when Ga is 3DT and Conditions 1 and 2 hold,
the algorithm in section 4.2 can be applied. In this case, even
if the constant M is set to 6, it is possible to guarantee that
the number of iterations K in steps (3) and (4) of procedure
Find_CSS is O (logna) (since 2n /3 + 1 £ 5n /6 holds for
n ³ 6). Also, since |S ij | = 1 (1 £ i £ kj) for each j, the time
required for step (4-2) can be reduced to O (nanblognb). As
a result, the time complexity of the algorithm in this case is
O (nanb log na lognb).Theorem 3. When Ga is 3DT and Conditions 1 and
2 hold, it is possible to find a largest common subgraph of
Ga and Gb in time O (nanblognalognb). "
5. Case Where Error Is Tolerated
In the applications such as the extraction of the com-
mon subgraph between chemical structures, it is important
to have available an algorithm, not for the strictly largest
common subgraph, but for a subgraph in which errors are
tolerated to some extent between the corresponding verti-
ces. For the largest non-straight-line common subgraph
algorithm in section 3, as well as for the algorithm in section
4, the procedure Match is used. When an error can be
tolerated, this procedure must be modified. The following
offers one possible approach.
Suppose that a vertex v and two edges e1, e2 adjacent
v are selected from Ga, and a vertex w and two adjacent
edges f1, f2 are selected from Gb. Then, the condition for
overlapping these is
where tl and ta are the tolerances for the length and the angle,
respectively.
When v, e1, e2, w, f1, and f2 satisfy all of these condi-
tions, they are overlapped by the method in 11.1.1 of Ref.
10. In that method, v is overlapped with w. A rotation is
applied so that the plane containing both e1 and e2 coincides
with the plane containing both f1 and f2. Then, a rotation is
applied with the straight line perpendicular to the above
plane and passing through v as the axis, so that edges e1 and
f1 overlap. The common subgraph is sought after the above
transformation. For two edges e = (vg, vgg) Î Ea and
f = (wg, wgg) Î Eb, if the distance between v¢ and w¢, as well
as the distance between v² and w² are both within a certain
threshold td, it is accepted that e and f coincide.
The above method is very simple, but the following
problem arises since v and w are overlapped with zero error,
especially when the method is applied to the procedure
Find_CSS in section 4.2. As was already described, when
a separator S is determined in Ga or its subgraph in this
procedure, the common subgraph is sought in step (4-2) by
overlapping the vertices of S with the corresponding verti-
ces of Ga. When this step is complete, for any vertex of S it
is impossible to derive a common subgraph that contains
two edges connected to that vertex. As a result, it may not
be possible to derive a largest common subgraph, in which
a vertex v of S containing two edges adjacent to v coincides
with a vertex of Gb within an error less than td even if such
a subgraph exists. In order to cope with this problem, further
elaboration is required.
6. Conclusions
We offer a method that can find a largest common
connected subgraph of two connected graphs Ga and Gb with
three-dimensional structures. An algorithm with smaller
asymptotic time complexity is presented for the case where
Ga is a tree or a planar graph with three-dimensional struc-
ture and the following two conditions are satisfied.
Condition 1: No vertex in Ga or Gb has order exceed-
ing a certain constant c.
Condition 2: No two adjacent edges in Ga are located
on a straight line. By a simple extension, the methods
presented can be applied also to the case where the vertices
and edges are labelled. When an error can be tolerated
between the corresponding vertices, further investigation is
required, as was pointed out in section 5. It is also left for
future study, using computer experiments, to attempt an
application to the extraction of common subgraph between
chemical structures.
Acknowledgement. The authors thank the re-
viewer for useful comments.
REFERENCES
1. M.R. Garey and D.S. Johnson. Computers and Intrac-
tability�A Guide to the Theory of NP-Complete-
ness. Freeman, San Francisco, CA (1979).
52
2. S. Ono (ed.). Computed Chemistry. Maruzen Co.
(1988).
3. M.M. Cone, R. Venkataraghavan, and F.W. McLaf-
ferty. Molecular structure comparison program for
the identification of maximal common substructures.
J. Am. Chem. Soc., 99, No. 23, pp. 7668�7671 (Nov.
1977).
4. Y. Takahashi, Y. Satoh, H. Suzuki, and S. Sasaki.
Recognition of largest common structural fragment
among a variety of chemical structures. Analytical
Sciences, 3, pp. 23�28 (Feb. 1987).
5. D.M. Bayada, R.W. Simpson, and A.P. Johnson. An
algorithm for the multiple common subgraph prob-
lem. J. Chem. Inf. Comput. Sci., 32, No. 6, pp.
680�685 (1992).
6. T. Akutsu. An RNC algorithm for finding a largest
common subtree of two trees. I.E.I.C.E. Trans. Inf. &
Syst., E75-D, No. 1, pp. 95�101 (Jan. 1992).
7. T. Akutsu. A polynomial time algorithm for finding
a largest common subgraph of almost trees of
bounded degree. I.E.I.C.E. Trans. Fundamentals,
E76-A, No. 9, pp. 1488�1493 (Sept. 1993).
8. S. Masuda, I. Mori, and E. Tanaka. An algorithm to
find the largest common subgraph of two trees. Trans.
I.E.I.C.E.J. (A), J77, No. 3, pp. 460�470 (March
1994).
9. Y. Takahashi, S. Maeda, and S. Sasaki. Automated
recognition of common geometrical patterns among
a variety of three-dimensional molecular structures.
Analytica Chimica Acta, 200, pp. 363�377 (1987).
10. J.-P. Doucet and J. Weber. Computer-Aided Molecu-
lar Design. Academic Press, London (1996).
11. T. Akutsu. On approximability of the largest common
point set of multiple point sets. Information Process-
ing Society of Japan Research Reports, AL33-9 (May
1993).
12. R.J. Lipton and R.E. Tarjan. A separator theorem for
planar graphs. SIAM J. Appl. Math., 36, No. 2, pp.
177�189 (April 1979).
13. T. Nishizeki and N. Chiba. Planar graphs�theory
and algorithms. Annals of Disc. Math., 32, North-
Holland, Amsterdam (1988).
AUTHORS (from left to right)
Sumio Masuda (member) graduated in 1979 from Dept. Inf., Fac. Eng. Sci., Osaka Univ. Completed 2nd Half of Doctor�s
Program in 1984 Grad. Sch. Dr. Eng. Served as Assistant and Lecturer, Dept. Inf., Fac. Eng. Sci., Osaka Univ. Since 1991,
Assoc. Prof., Dept. Elect. Elect. Eng., Fac. Eng., Kobe Univ. Mostly engaged in research on algorithm design and graph theory.
Member, IEEE and Inf. Proc. Soc.
Hiroyuki Yoshioka (student member) graduated in 1996 from Dept. Elect. Elect. Eng., Fac. Eng., Kobe Univ. Presently,
Grad Student in 1st Half of Doctor�s Program. Interested in design of graphic algorithm.
Eiichi Tanaka (member) graduated in 1962 from Dept. Electrical Eng., Fac. Eng., Osaka Pref. Univ. Satisfied credit
requirement and graduated from Doctor�s Program in 1967, Grad. Sch., Osaka Univ., and became Assistant, Dept. Electrical
Eng., Fac. Eng., Osaka Pref. Univ. Prof. 1977 Dept. Inf., Fac. Eng., Utsunomiya Univ. Presently, Prof. Dept. Elect. Elect. Eng.,
Fac. Eng., Kobe Univ. Engaged in research on graph theory and file composition method. Dr. Eng. Member, IEEE, Inf. Proc.
Soc. and Math. Soc.
53