17
Cluster Analysis, Graphs, and Branching Processes as New Methodologies for Intelligent Systems on Example of Bibliometric and Social Network Data Maria Nowakowska The Ohio State University, Columbus, OH 43210 This article presents (1) a general formalism for cluster analysis, allowing a systemic study of simulation research, in particular its dynamic aspects, (2) a model of small bibliographical clusters, allowing inference (among others) on the connectivity of do- mains, and (3) an outline of new theories of networks with randomly changing nodes and edges, applicable for analysis of different types of relations, e.g., communication be- tween scientists, etc. These models may be useful for analysis of large databases in artificial intelligence. They may also have significance as new approaches to neural network analysis. I. INTRODUCTION The aim of bibliometrics is to obtain qualitative information about various disciplines of science through an analysis of publications, but without reading the contents of these publications. The central idea is that some easily accessible attributes of published pa- pers, such as their titles, keywords, or lists of references, might-under suit- able analysis-provide information about the structure of the discipline and its relation to other disciplines. In Section I1 we present a brief outline of one of the leading bibliometric methodologies, namely cluster analysis, and discuss briefly the restrictions of its applicability. The cluster analysis is shown there as a formal system. This approach may serve as a foundation for unified, rigorous, and refined simula- tion analysis of cluster techniques. In Section I11 we present a possibility of a new cluster methodology for the analysis of bibliometric data, where the main source of information are the shapes of small clusters. This methodology uses the theory of branching pro- cesses and graphs to generate the clusters. The intuitions for this analysis are taken from polymer chemistry. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 5,247-264 (1990) Q 1990 John Wiley & Sons, Inc. CCC 0884-8173/90/030247-17$04.00

Cluster analysis, graphs, and branching processes as new methodologies for intelligent systems on example of bibliometric and social network data

Embed Size (px)

Citation preview

Cluster Analysis, Graphs, and Branching Processes as New Methodologies for Intelligent Systems on Example of Bibliometric and Social Network Data Maria Nowakowska The Ohio State University, Columbus, OH 43210

This article presents (1) a general formalism for cluster analysis, allowing a systemic study of simulation research, in particular its dynamic aspects, (2) a model of small bibliographical clusters, allowing inference (among others) on the connectivity of do- mains, and (3) an outline of new theories of networks with randomly changing nodes and edges, applicable for analysis of different types of relations, e.g., communication be- tween scientists, etc. These models may be useful for analysis of large databases in artificial intelligence. They may also have significance as new approaches to neural network analysis.

I. INTRODUCTION

The aim of bibliometrics is to obtain qualitative information about various disciplines of science through an analysis of publications, but without reading the contents of these publications.

The central idea is that some easily accessible attributes of published pa- pers, such as their titles, keywords, or lists of references, might-under suit- able analysis-provide information about the structure of the discipline and its relation to other disciplines.

In Section I1 we present a brief outline of one of the leading bibliometric methodologies, namely cluster analysis, and discuss briefly the restrictions of its applicability. The cluster analysis is shown there as a formal system. This approach may serve as a foundation for unified, rigorous, and refined simula- tion analysis of cluster techniques.

In Section I11 we present a possibility of a new cluster methodology for the analysis of bibliometric data, where the main source of information are the shapes of small clusters. This methodology uses the theory of branching pro- cesses and graphs to generate the clusters. The intuitions for this analysis are taken from polymer chemistry.

INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, VOL. 5,247-264 (1990) Q 1990 John Wiley & Sons, Inc. CCC 0884-8173/90/030247- 17$04.00

248 NOWAKOWSKA

One of the main difficulties of bibliometric analyses is the formidable amount of data which have to be stored in computer memory. In contrast to that, the approach suggested here is empirically feasible.

In Section IV we show new methods of analysis of large networks, with nodes interpreted as persons. This section has a similar inference structure as the preceding one, in the sense that instead of studying the whole data set, one studies only a subset and uses mathematical theorems (in this case on random graphs) for inference on the whole process. An additional novelty here is the study of networks with changing sets of nodes and edges. This section presents results on inference about communication networks analyzed in Ref. 1.

11. CLUSTER ANALYSIS

A. General Scheme

The purpose of cluster analysis2-6 is to organize sets of objects into clus- ters, given the data about these objects in form of the values of selected attrib- utes. Typically, the sets of objects are large, which presents additional techni- cal problem^.^

The objective of cluster analysis is usually to create a taxonomy, verify some existing taxonomy, or possibly formulate and/or test some hypotheses.*

The central principle used is that objects which are close one to another should be included into the same cluster. It follows therefore that the results will crucially depend on:

(a) the set of attributes selected, (b) the chosen measure of closeness, and (c) the algorithm of forming clusters out of sets of objects, given the data on

their “distances”.

As regards (a), any measure of closeness or “proximity” is built on the degrees of similarity of objects as given by.the values of their attributes. The choice of these attributes is therefore quite crucial: on the one hand, an impor- tant attribute may be omitted by an oversight or by lack of knowledge that it is important for the phenomenon analyzed. It is also possible that two scientists differ in their opinion about importance or relevance of an attribute for the phenomenon studied, so that some attributes may be omitted, so to speak, “by design.”

On the other hand, there is a (rather common) tendency to take into ac- count many attributes, so as to be sure to “cover” the domain. The argument here is that inclusion of inessential features does not do any harm, while one gets a fair chance of including all essential features. This is especially common when a powerful computer is available to handle large amount of data, and the latter are easily collected. However, the argument that there is no harm in inclusion of attributes which may later turn irrelevant, hence starting from very large attribute sets, is not quite convincing: the danger is to include attributes

NEW METHODOLOGIES 249

that are closely related, which in effect amounts to giving too much weight to some features.

Let us mention here that the notion of “object” and “attribute” may be interpreted so that the same object is observed at consecutive occasions (at various times), with possibly changing sets of attributes observed. In this case, the notion of object is in fact interpreted as a “history” of an object. As a result, one may study the dynamics of clusters under changing (possibly ran- domly) classificatory schemes. One problem here is to find invariants of chang- ing clusters, and the second is the optimalization of classificatory schemes.

As regards (b), even after a choice of set of attributes is made, there is a multitude of proximity measures, starting from correlation coefficients (espe- cially useful when there are many attributes observed), metrics of various kinds (not necessarily Euclidean), nonmetric proximity measures (e.g., ones that are not symmetric), and so forth.

As regards (c), clustering methods fall into a number of classes, depending mostly on the objective function to be optimi~ed,~-ll and application of various algorithms to the same set of data may lead to disturbingly different cluster configurations.8 The choice of the “proper” or “best” algorithm depends on the results of validation.

The obtained configuration of clusters is typically described in terms of a number of parameters, such as “density” of points within the cluster, degree of “separation” of distinct clusters, their “hierarchical” or “nonhierarchical” structure (e.g., are there subclusters within clusters or not?), and so on. These terms carry a clear intuitive content, and it is typically easy to suggest a work- ing definition for each of them. The trouble again is that it is also easy to suggest another, equally good working definition, which gives a different competing description.

Finally, let us remark that there is no statistical foundations of cluster analysis, in the sense that the significance, stability, etc. of clusters have so far been studied only through simulation methods (for some exceptions, however, see Ref. 12).

B. A Formal Scheme

In this section, we shall outline a conceptual framework for cluster analy- sis, with the primary objective being to provide a possibility of systematic analysis of processes of validation.

From the preceding section it is clear that in description of cluster analysis, one may distinguish two components, which we shall write symbolically as

(A03 Ad, (1) with the following intended interpretation.

Firstly, A. is a system describing the “truth,” that is, a set of objects partitioned into clusters and a mapping of objects into points of a metric space. Thus,

(2) A = (z, u1, * - 3 uk, R, d , h ) ,

250 NOWAKOWSKA

where 2 is a nonempty set (of objects to be clustered), U1, . . . , Uk are disjoint nomempty sets whose union is 2 (referred to as “true” clusters), (R, d ) is a metric space, and h : 2 + R is a representation of objects as points in R. The system A must satisfy some “axioms of clustering.” These axioms may vary, depending on a specific purpose of analysis, but qualitatively speaking, they assert that the distance

d[h(z’), h(z‘7l (3) between representations of objects z’ and z“ tends to be shorter when z’ and z” are in the same cluster. The choice of particular axiom determines, to a large extent, the clustering algorithm. For instance, one may require that average distance of points in the cluster must be less than the average distance between points between clusters, a requirement that could be repeatedly checked by the algorithm which re-allocates points to clusters until the condition is met.

Next, the second component of the system is designed to express the idea that (1) objects are described through values of their attributes, (2) the experi- menter may choose the attributes he or she wishes to observe, (3) the observa- tions may contain errors, (4) the values of attributes of objects zr and z’~ provide information about distances between points h(z‘) and h(2’).

Formally,

A1 = ( N , V , T , Q>, (4) where the intended interpretation is as follows. Firstly, N is the set of names (labels) of attributes of objects, while V is a family of sets

{V(i) , i E N } (5 )

with V(i) being the set of possible values of attribute i [in the simplest case all attributes may be numerical, so that each V(i) is the real line, but one can easily consider also qualitative attributes, such as colors, etc].

Next, T represents the true attributes of objects, so that T is a mapping

T : Z X N + V (6) which to every object z and attribute i in N assigns the value T(z, i ) in V(i) , interpreted as the “true” value of ith attribute of z .

Finally, Q is a family of probability distributions,

Q = { P N ~ , ~ , N‘ C N , z E Z } , (7) where N’ is a nonempty subset of N and z is an object in Z , with PNl,z being a probability distribution on the Cartesian product of sets V(i) with i in N‘. Here the interpretation is that Pw,,(M) is the probability that observation of attributes in N’ of object z will give values in set M.

The axioms concerning this part of the system are of two kinds. Firstly, one must postulate that observations provide-despite possible distortions- information about the true values of attributes. Qualitatively speaking, this means that the probabilities of observing values of attributes that are close to true values must be in some sense hi& (perhaps upon repetition of observation,

NEW METHODOLOGIES 25 1

etc.). For instance, one may postulate here the existence of consistent estima- tors of true values of attributes.

Secondly, there must be axioms that relate the distance between represen- tations h(z’) and h(z”) of objects (in space R), and the values of their attributes. Qualitatively, the more attributes of z‘ and zrr we observe, the better our infor- mation about the distance (4) between h(z’) and h(z“). Note that here the in- crease in information is not related so much to repetition of observations, as to observing new attributes.

Any instance of cluster analysis may now be regarded as follows: the experimenter chooses a set N’ of attributes that he or she wishes to observe (one may also regard this choice as random, by adding a suitable probability distribution to the system, if such approach would seem suitable in a given instance). Then the elements of 2 are sampled (or: the whole set 2 is analysed, depending on the context), and the values of attributes from N ’ of sampled objects are observed, possibly with observation errors, as specified by proba- bility measures in Q. This provides the raw data.

Next step consists of choice of an estimator of the distance d between objects in space R. This is to be regarded in a wide sense: sometimes one makes only inference about order relations between distances, through some proxim- ity measures, not necessarily a distance, etc.

Finally, one applies one of a number of algoriths, which ultimately pro- vides a partition of selected objects into clusters.

The validity of clustering procedure may now be assessed by comparing the true partition into clusters with the one obtained empirically.

Each instance of testing of a clustering procedure through simulation con- sists of specifying components A0 and A , of the scheme above to the extent that enables one to generate artificial observations about some postulated “objects” with imposed structure.

To make such a simulation, one must specify all of the postulates (axioms) that have been qualitatively sketched above: regarding the “true” metric struc- ture of clusters, and regarding the relationships between attributes, observa- tions and the metric.

Such a specification may provide a useful taxonomy of validation methods as used so far for cluster analysis.

C . Application in Bibliometrics

In addition to the difficulties above, which are connected with every appli- cation of cluster analysis, there are difficulties specific to bibliometric studies. The main such difficulty lies in selecting the set of attributes which would later serve for determining the measure of proximity between publications. There seem to be only two candidates for such attributes, if one excludes the attrib- utes related to semantics (which require reading the papers and passing a judgment about similarity of the results, their interrelations, etc). These two candidates are lists of keywords, and lists of references. Both may provide measures of proximity of papers based on the qualitative idea that a big overlap

252 NOWAKOWSKA

in reference lists, andlor lists of keywords is likely to indicate a correspond- ingly big semantic overlap.

For technical reasons, reference lists (intuitively, a better source of infor- mation about proximity than keywords) are not usable in a larger scale re- search.

On the other hand, lists of keywords can provide, at best, a very coarse proximity measure (not to mention the fact that some journals do not require keywords at all, which puts a bias into the analysis).

In the remaining sections, we present a method of inference from bibliome- tric data which uses small clusters, and therefore does not utilize algorithms of cluster analysis, and which is empirically feasible as regards the amount of technical work.

111. POLYMERIC ANALYSIS OF CITATIONS

A. Introductory Remarks

Formation of polymers may be roughly described as a process of succes- sive attachments of free-floating particles to the appropriate locations of the existing polymer molecule. In this way, the latter grows into a larger and larger cluster.

This phenomenon has been analyzed by a simple application of the theory of branching processes: when a polymer molecule stops growing, it may be-at least conceptually-viewed as a tree, and therefore as a sample path of a branching process (for a presentation of the theory of branching processes see, for instance, Refs. 13 and 14).

Now, a branching process is completely characterized by the probability distribution of the number of “offspring” of a member of the population, which in this case translates into the number of branches leaving a node of the tree.

(8) where p(k) is the probability of exactly k offspring. Such distributions are conveniently represented by their probability generating functions

Formally, this distribution is of the form

M O ) , P ( l ) , P(2) , - f .}

f(d = x p(k)sk. (9) We have here the mean number of offspring rn given by

m = 2 kp(k) = f ’ ( ~ ) . (10)

One of the main facts about branching processes is that if the mean m exceeds 1, then there is a positive probability that the process will grow up to infinity (interpreted in terms of polymer chemistry as the fact that the whole substance will solidify, forming one big molecule; this, for instance, is the principle of operation of most types of glue). On the other hand, if the mean rn of offspring distribution is less than 1, the process will terminate after a finite number of steps with probability one, so that-in terms of polymers-the

NEW METHODOLOGIES 253

molecules will be formed, but each will be a cluster of particles of a finite size. What will be observed in reality is a thickening of the substance, without reaching the point of solidification. In this case, various properties of the branching process in question may be interpreted as the appropriate properties of the resulting polymer “soup.” For instance, the average size of the branch- ing process to extinction, being the average cluster size, determines the density of the polymer, while the average number of generations to extinction (cluster length) allows to distinguish the case of polymers with long and thin molecules from the case of short and “fat” ones (e.g., long molecules of mayonnaise, which may be broken just by reversing the mixing direction-an effect known to housewives who make mayonnaise at home-as opposed to, say, thickening of gravy or jelly, resulting from formation of bigger and bigger, but not neces- sarily longer molecules).

The distribution of the total size of the process until extinction, as well as the number of generations to extinction may be obtained from the probability generating function f ( s ) . Specifically, if G(s) is the probability generating func- tion of the total size until extinction (that is, the sum of sizes of all generations), then G(s) is the solution of the equation

G(s) = sflG(s)]. (1 1) On the other hand, to obtain the probability distribution of the number of

(12)

Then f,(O) is the probability that nth generation is zero, hence the probabil- ity that extinction occurred at nth generation or earlier. Consequently, the difference

(13)

generations to extinction, define the iterations off(s) by

h(s) = f(s>, ffl+l(s) = flf(s)l.

44 = ffl(0) - f-1(0)

is the probability of extinction in exactly nth generation.

B. Application to Citations

The aim of analysis would be mostly to study how much the citation clusters deviate from the predictions based on the theory of branching pro- cesses. In this section, the term “citation cluster” or “citation tree” is used loosely, meaning either the graph of citations “forward” or citations “back- ward,” i.e., either citations or bibliographical coupling. The choice depends, of course, on the specific aim of analysis and on availability of the appropriate data.

Now, deviations between the predictions and the actual clusters are due to the same paper being cited by many other papers, which violate the assumption of independence, crucial in branching processes. The deviations measure therefore the degree of “connectivity” of a given discipline.

If one tries to apply the theory of branching processes to the citation clusters, a difficulty is immediately encluntered, since the average number of

254 NOWAKOWSKA

citations in a paper is much higher than one. The same might be true about the number of papers that cite a given paper. Thus, a direct application would lead only to trivial results, since the modeling processes would always be supercriti- cal (that is, with mean offspring number rn exceeding 1).

To use the polymer analogy, one needs therefore to reduce the phenome- non analyzed to a subcritical branching process, yet retaining the basic struc- tural properties of citation graphs,

Now, to bring the analysis to the realm of a subcritical branching process, one may simply take some random sample of citations-that is, include a citation in the sample only with a certain probability. This probability, say p, ought to be adjusted so that the mean number of citations analyzed per paper (number of its “offspring” or “ancestors”) is less than one.

The parameter p might be kept constant, or it may be varied in such a way that the mean number of citations used per paper approaches the critical value 1.

The restriction of citations may be obtained in several ways, which can also be combined. Thus, one may use some selection of semantic type, for instance consider only papers which have the word “NEW” in the title, that is, papers in which the author announces that he or she invented or discovered some new theory, fact, method, hypothesis, and so forth.

Secondly, the selection in the sampling of citations may be attained simply by considering only authors whose names start with a given letter or combina- tion of letters. Of course, one has to be careful here: choosing, say, only authors whose names start with “AND” would result in a bias towards Scandi- navians (name Anderson being more frequent there than in other countries), while, say, choosing scientists with names starting with “SZ” would restrict the analysis exclusively to Poles and Hungarians. Also, a decision will have to be made regarding joint authorship.

At any rate, after choosing a selection principle, e.g., a combination of initial letters such that (1) it would lead to an unbiased choice, and also (2) reduce the average number of citations per paper to a value less than one, the empirical analysis will become manageable, owing to a great reduction of the material to be studied. Probability distribution of the number of “offspring” will be easily accessible, with p(k) being estimated as the frequency of papers which cite (or are cited in ) exactly k papers of authors whose names start with chosen letters.

The estimates of probabilities p(k) will allow us to estimate the probability generating functionf(s), and hence also the distribution of total size to extinc- tion, or distribution of the number of generations to extinction. The results of these computation will then be compared with empirical data on sizes, lengths, etc., of citation clusters.

As mentioned, the deviations from the predictions would provide informa- tion about the degree of “connectivity” of the discipline. Of more interest, however, is the possibility of being able to distinguish here various types, or phases, of a discipline, characterized in terms of polymer analogy.

Of course, the value of this approach lies in the possibility of a meaningful

NEW METHODOLOGIES 255

and interesting interpretation of the findings, Dropping the suggestive but somewhat facetious culinary terms, one can use the terms “fibrous” if the clusters are long and thin (like in mayonnaise), or “lumpy” otherwise. The question then is: what does it mean that a given discipline is in the “fibrous” phase, as opposed to the “lumpy” phase?.

As a first try, one may hypothesize that a discipline is in “fibrous” phase, if its development thus far was such that for some generations, there was always a leading scientist in each generation, who would be cited often. Such a situation would tend to give “thin and long” citation clusters, that is, branching process populations which persist for many generations, but with rather small sizes.

On the other hand, disciplines that develop rather rapidly in many centres would tend to give “lump” clusters, having fewer, but larger, generations.

As hypothetical examples, one might expect that a discipline such as fuzzy set theory would be in the ‘‘lumpy’’ phase, while to look for a “fibrous” discipline, one might try, say, knowledge of something with long tradition but not too popular, such as comets or meteors.

C. An Example

To illustrate the suggested approach, suppose that a choice of initial letters is made, and that this choice satisfies the requirements specified in the preced- ing section. To fix the ideas, suppose that we are performing the (simpler) analysis, looking backward in time, that is, looking at papers cited in a given paper. We are restricting the attention to scientists whose names start with some selected combination of letters; we shall refer to them as L-scientists. The average number of L-scientists cited by an L-scientist is m, and we know that m < 1 (this means that in practice one should disregard self-citations: by definition, a self-reference by an L-author is a citation of an L-author, and if we were to count it, we would very likely have m > 1).

Differentiating both sides of formula (1 1) and putting s = 1, we obtain an equation for M = G’(l), which is the total size of the cluster starting from a paper by an L-scientist. This equation yields, under the assumption m < 1:

M = 1/(1 - m). (14) Now, empirical estimation of m, the average number of references to L-

scientists per paper presents no problem (few hours of work in a good library perhaps: if rn is to be less than 1, then most papers will have no names of L- scientists at all in their lists of references!). This will give a predicted value of M, the average size of the total “cascade” originated by a paper by an L- scientist. Another couple of hours of work in a library would give an estimate of M (through recording the sizes of, say, 50 such clusters, and taking their aver- age size).

Comparison of the estimated M and predicted M will give us an estimate of the measure of connectedness of the domain.

256 NOWAKOWSKA

D. Statistical Tests

A natural question arises of the statistical significance of the obtained results. In other words, given the difference of observed and predicted M, we would like to know whether this difference may be attributed solely to sampling variability (i.e., be insignificant). To solve this problem, one needs the informa- tion about sampling variability.

As regards m, we have E(X) = m = f’(1) andf”(1) = E[X(X - l)], so that

(15)

where X is the random variable representing the number of L-citations in a paper. The data used to estimate m provide also estimates of the probabilities p(k) of k = 0, 1, 2, . . ., L-citations, hence we have an estimator of generating function f and variance of X.

As regards the variance of size Y of a citation cluster originated by an L- scientist, observe that we have

Var( Y) = G”( 1) + M(l - M) (16) where M = G’(1) = 1/(1 - m). To get G(1) we differentiate formula (1 1) twice and after some algebra we find Var(Y). Consequently, we may obtain reason- able estimates of both variances, and build procedures for testing the statistical significance of the empirical results.

Var(X) = f”(1) + f’(l)[l - f’(1)l

IV. ANALYSIS OF LARGE NETWORKS

A. A New Approach to Communication Networks

Let us remark first that the term “communication” need not be treated literally. Formally, the analysis will concern a binary relation, say C, in the set S of all scientists working on a given domain, with C representing some sort of a scientific contact. The symbol xCy may mean that y subscribes, or usually reads, the journal in which x published a paper; it may mean that y was in the audience of x’s lecture; it may also mean exchange of reprints between x and y, etc., as well as all of the above together.

It is assumed that a fixed scientist serves as a source of some new idea, result, fact, etc. and the information about it spreads along the edges of the graph of relation C. The main problem will be to study the proportion of persons in S whom the new idea eventually reaches, speed of the spread, and so on.

There are basically two approaches that are possible here. One (very much explored, as the problem is one of the oldest in sociometry, and has already a long research tradition) is to make some assumptions about the relation C, and then to deduce the properties of interest from these assumptions. Such an approach is reasonable for small groups. However, for a scientific community concerned with one discipline the set S will typically consist of a number of individuals so large that enumerating the relation C will be impossible.

NEW METHODOLOGIES 257

Thus, an alternative is to assume that the relation C is random and analyze the probabilities of various events, especially those which have probability close to 1 (such events are of primary interest because they allow us to infer that analogous events must occur in the real scientific communities, regardless of the specific form of the relation C ) .

Now, there is a variety of sampling schemes which produce random graphs, of which three deserve special attention.

Let us consider graphs of relation C which is irreflexive, that is, xCx does not hold for any x. In such cases (when one considers directed graphs, that is, such that xCy does not necessarily imply yCx) , for a graph with n nodes there are n(n - 1) possible edges. Various sampling schemes specify the probability of a given edge being actually in the graph, and interrelations between these events.

Under scheme 1 , for each of the n(n - 1) pairs of distinct nodes, the probability that these nodes are connected is p , independently of other pairs. The scheme depends, therefore, on one parameter p , and the total number of edges is random, with binomial distribution. The average number of edges is n(n - 1)p and the variance is n(n - l)p(l - p ) .

Under scheme 2 all n(n - 1) pairs are sampled without replacement, and the sampled pair is connected with an edge. The sampling continues until k pairs are chosen, so that the number of edges is fixed and equals k .

Finally, under scheme 3, one chooses a functionfmapping S into S, and such thatf(x) is never equal to x. This may best be described by an urn scheme. Imagine namely an urn with n balls, labeled 1 , 2, . . . , n. In kth sampling one first removes ball labeled k from the urn, and then one samples one ball. Let the result be x(k) . Then the graph contains an edge leading from k to x(k) , and the sampled function is defined by f ( k ) = x(k) .

To formulate the conjectures, let T denote the transitive extension of C . This means that xTy if either xCy or there are nodes z , z', . . . , z" such that XCZ, zCz', . . . , z"Cy (in other words, one can reach y starting from x and going along edges of relation C ) .

Let us now fix x and put

A(x) = { y :xTy} (17)

w-4 = ( Y : Y T X I , (18)

and

so that A(x) is the set of all nodes which can be reached from x, and B(x) is the set of all nodes from which one can reach x.

Finally, for a set of nodes D let

A(D) = u A W , B(D) = u B(x) (19)

be the sets of points which may be reached from any of the nodes in D and from which D may be reached.

Now, if the relation C is random so are the sets A(D) and B(D). One of the

X E D XED

NOWAKOWSKA

interesting questions is: what can one say about the probable sizes of these sets under each of the considered sampling schemes? In particular, what is the probability that the size of A(D) or B(D) will be close to n (that is, almost everyone will be reached)?

In the sequel, rn will stand for the size of the “seeding” set D. As before, n will denote the size of S, p is the probability of an edge in scheme 1, and q = Wn is the average number of edges leaving a node in scheme 2.

Firstly, one may conjecture that in schemes 1 and 2 there exist thresholds in the density of edges. To formulate this conjecture, for simplicity for sets A(D) only, consider the probability

P(IA(D)I > cn) (20) that the size of the “infected” set A(D) will exceed the fraction c of the whole set of n nodes (here c is a number such that 0 < c < 1). One has to keep in mind that probability (20) depends on m, n and also on eitherp or q, depending on the chosen sampling scheme.

We may formulate the following hypothesis. Hypothesis A. For sufficiently large n, fixed m and c, there exists a thresh-

old inp (resp. q), that is, probability (20) is close to 0 ifp (resp. q) lies below this threshold, and close to 1 if it is above it.

To put it differently, let us state this conjecture in terms of spread of a new idea (gossip, etc.) across the population. It is postulated that we have an “all- or-nothing” effect: $(say) the average number of edges per node q is small (lies below a threshold), only negligibly few persons will learn about the new idea. On the other hand, if q is above the threshold, the news will reach practically everyone. There is no chance (or: only a negligibly small chance) that the news will reach substantial number of persons, but also a substantial number will remain unaware of it.

Further, it may be conjectured that schemes 1 and 2 are essentially the same for larger n. This is due to the fact that for larger n the average number of edges in scheme 1 will be close to the corresponding number of edges in scheme 2, that is, nq edges. This yields an approximate relation pn(n - 1) = nq, hence p = 4/n, and one may state the folIowing hypothesis.

Hypothesis B. As n increases, the probabilities of events in scheme 1 for n. m and p should be approximately equal to the probabilities of the same events in scheme 2 for n, m and qln.

In a similar way one may conjecture that there exists a threshold in the size m of the “seed” set D. Accordingly, we have:

Hypothesis C. For sufficiently large n, fixed p (resp. q) and c, probability (20) is close to 0 if m lies below a certain threshold, and close to 1 if m lies above it.

It appears that scheme 3 is most interesting. As regards sets A(D) one may conjecture the following.

Hypothesis D. Under scheme 3, probability (20) is close to 1 for sufficiently large n if only m > 0.

NEW METHODOLOGIES 259

This hypothesis means that the spread will likely be the same, whether it starts from a single source, or from a multiple source.

Regarding sets B(D) for scheme 3, we have: Hypothesis E. For all sufficiently large n and all 0 < c < C’ < 1, probability

(21) is close to zero.

This hypothesis asserts that the distribution of the size of B(D) is U- shaped: it is unlikely that the size of this set will form a fraction of the whole society which is neither close to 0 nor close to 1. This is due to the fact that there is a positive probability that each of the sets B(x) will be empty, and there is also a positive probability that there will be several edges leading to x.

If these hypotheses were true, they would have some very interesting social implications concerning the optimal choice of the density of network, in the case when it is reasonable to assume that the network arises from either scheme 1 or 2, and also the optimal choice for the size of the group which is to spread effectively the new ideas across the community.

The suggested approach offers also numerous other possibilities. To men- tion just a few of them, suppose that one wants to study the resistance to new ideas, results, etc. before passing them further. Such a scheme may be attained simply by assuming that information is accepted by a person only if it received from at least r sources, so that r is a measure of resistance. Thus, the spread proceeds as before, except that information leaves a nodes only if there are at least r edges leading to it, each originating from a node with the same property, and so on. Here one may also expect some threshold effects under the change of resistance level r.

One can also try to use this approach to study the effects of valuation on the spread of scientific information. The central idea here would consist of assuming some kind of filtering effect.

Such a “filter” may generally be described in terms of probability of a specific change of information arriving at a node. To present it formally, as- sume that the graph has already been sampled. Then every node is character- ized by a pair (m, k) , where m is the number of edges leading to the node, and k is the number of edges leading from it. Clearly, we are interested only in cases when both m and k are positive, since nodes of the type (0, k) never receive any information, while a node of the type (m, 0) never sends any.

To observe the effects of valuation, assume that information is transmitted at times 0, 1, 2, . . . : information which reaches a node at time t leaves it at time t + 1 (possibly transformed).

Let I stand for the class of all possible information items that may be transmitted along the edges (including no-information message O ) , and let Z(n) be the n-fold Cartesian product of Z with itself. Then any (m, k) node receives an m-dimensional message X , being an element of I(tn), with one component received along each of the tn edges, and sends out a k-dimensional message Y, being an element of Z(k).

P(cn < IB(D)l < c’n)

260 NOWAKOWSKA

The most general description of the way of operation of a node, which covers both the deterministic and stochastic case, is that an (m, k) node is represented by a family of conditional probability distributions P( Y w) of send- ing out Y when X was received, with X being in I(m) and Y being in I(k).

One can now characterize various types of nodes. Firstly, if P(OI0) = 1 (with 0 interpreted appropriately as a vector of components 0), then the node is nongenerative: it sends no message if it does not receive any. Similarly, a “black-hole’’ node is such that it never sends any message, i.e., P(0IX) = 1 for any X.

The specification of behavior of an (m, k) node requires assigning a proba- bility distribution over the set of all k-vectors to each m-vector. Unless there are very few nodes, such a specification is, in general, unmanageable, and it is therefore important to introduce some simplifications.

Let g be any permutation of coordinates of vector X, and let gX be the vector X with its coordinates permuted according to g . We say that a node is reception nondiscriminating, i fP(Y1X) = P(Y1gX). On the other hand, it is transmission nondiscriminating, if P( YlX) = 0 unless all coordinates of Y are equal. A node is nondiscriminating, if it is nondiscriminating as regards to both reception and transmission. Thus, a nondiscriminating node is such that it sends the same message along all edges, and this message depends on what was received only through the total counts of values of the coordinates. In different words, probability of “reaction” is the same for two “stimuli” if they differ only by order of coordinates.

To formulate the conjecture about information flow, assume that at time r = 0 a message was sent from a fixed node of a network generated according to one of the schemes 1-3. Let Z(t) be the set of all messages transmitted at time c , and let Z(t , i) be the number of elements of the set Z(t) which are equal i (so that Z(t , i) is simply the number of times message i was transmitted at time t).

Hypothesis F. Assume that all nodes are nondiscriminating and no node is a black hole. Then, under either of sampling schemes 1-3, for all sufficiently large n, if the size of Z(t) does not become 0, then the ratios IZ(t, i)lllZ(t)l tend to some limits.

In other words, it is conjectured that if the network is large enough in the sense of number of nodes and edges, the signal may spread to a sufficiently large set of nodes to persist, and in this case, the frequency of various types of signals will stabilize.

It appears that the above type of behavior of random graphs constitutes a novel and fruitful line of research, especially convenient for simulation. The flexibility of assumptions allows us to model here a variety of situations.

B. Networks with Randomly Varying Nodes and Edges

A possible alternative approach to random graphs is as follows. For most sociological research, the connections in the social networks are relatively stable. However, it is of great interest to study also the dynamics of networks, allowing for the possibility that the relation C is of transient nature. Moreover,

NEW METHODOLOGIES 261

one may assume that not only edges, but also nodes are random, in the sense that they may appear or disappear, and hence last for some random time.

Given a graph, it will now be convenient to represent it as a matrix, with entries a(i, j ) , with a(i, j) being 1 or 0, depending on whether or not the nodes i and j are connected (that is, are in relation C). Now, at any time t the following events may occur:

-event A(i, j ) defined as the disappearance of the edge from i to j (if such an edge exists),

-event B ( i , j ) defined as the appearance of the edge from i t o j (if such an edge does not exist),

-event C(i) defined as disappearance of ith node (if such a node exists), implying also disappearance of all edges leading to and from this node,

-event D, defined as the appearance of a new node (which is then as- signed an available label i).

There may be a technical problem with labeling the nodes, but this problem is important only for simulation. One may assume that when a node disappears, all other nodes and edges are relabeled, so that the numbering starts always from 1 and runs consecutively. As a consequence, there may be several nodes, all with the same label, existing at various times.

Let S( t ) stand for the set of all nodes existing at time t , and let N(t) be the size of S( t ) . Furthermore, let R(i, t , +) and R(i, t , -) be the sets of nodes in S(t) which are connected to node i by an edge leading from it, and leading to it.

Assumption 1. The process of changes of the network is a Markov process: given the state at time f, the probabilities of various events at times following r do not depend on the history of the process at times prior to t.

Assumption 2. If the event C(i) occurs at time t, then all events A(i, j) f o r j in R(i, t , +) and all events A ( j , i) for a l l j in R(i, t, -) also occur.

This means that a disappearance of a node implies also the disappearance of all edges leading to and from this node.

Assumption 3. The intensity of event D (new node) is 8. An alternative to this assumption is Assumption 3‘. The intensity of event D is A M ) ) . Under Assumption 3, the “arrival rate” of new nodes is constant, while

under Assumption 3’ it is proportional to the actual number of nodes (so that in this case it is more appropriate to refer to it as “birth rate”).

The situations in which the first of these assumptions may be reasonable is when we consider some special networks, for instance in a classroom, as it evolves during the school year. Here the new nodes would be new pupils, as they come to the class when their parents move to the neighborhood. In scien- tific applications, such a network may occur in a relatively stable department, with few new appointments.

On the other hand, Assumption 3’ may be more adequate when we con- sider, say, networks of infections in an epidemic, where the arrivals of new nodes (infectives) may depend on the total size of epidemic.

262 NOWAKOWSKA

Assumption 4. The intensity of occurrence of events C(i) is pN(r). Thus, the “death rate” for nodes is proportional to the number of existing

nodes. To formulate a class of assumptions concerning the remaining types of

events, namely appearance or disappearance of an edge, assume, for instance, that a(i , j) = 0, so that there is no edge from i toj. Then the probability that such an edge will appear is proportional to some function of five arguments: sizes of sets R(i, c , +), R(i, c , -), R( j , t, +) and R( j , t , -), and the value of a(j , i). This is a natural assumption: the sizes of the first four sets describe the “popularity” and “expansiveness” of the two nodes in question, while a( j , i) provides information whether the reciprocal edge, from j to i, exists or not.

The simplest part of the analysis, which may be carried out without simula- tion, concerns the process N(c), namely the total number of nodes.

Under Assumption 3 of constant arrival rate, this is simply a process of death and immigration, while under Assumption 3’ it is a linear birth and death process. I4*l3

The objective is to find the probabilities P,(t) = P{N(t) = n}, or perhaps only the stationary distribution (if it exists), that is, the limits p(n) of probabili- ties P,(c) when time c grows to infinity. Under Assumption 3, of constant arrival rate, the stationary distribution exists, and is given by

(22) where A = B/p, so that the distribution is Poisson with mean A.

Under Assumption 3’ the situation is different. First of all, the population of nodes will die out with probability one if A < p or if A = p. If A > p, the set of nodes dies out or expands to infinity. Thus, no stationary distribution exists.

The expected number of nodes grows exponentially, and we have an inter- esting question: what will happen with the network in such a case? If the process of nodes grows very fast, then perhaps the edges will be born so slowly that the graph will have relatively small number of edges, hence fewer and fewer connections relative to its size, not enough to sustain information flow.

At the end, let us sketch some further possibilities of research. Thus, one could try to connect the present approach to networks with the

notion of filters from the preceding section, describing a specific way in which a node operates. Such an approach, the most comprehensive so far, meets with very serious difficulties, at least in the following two aspects.

Firstly, to describe the operation of a filter, one needs to specify the probability distribution of “response” given the “input.” Since now a node may have randomly changing numbers of edges leading to and from it, a de- scription would involve not just one, but a whole family of probability distribu- tions.

Secondly, even for relatively small networks, such a description would be too complicated for simulations. Nevertheless, it is of conceptual value, allow- ing us to isolate the sources of complexities inherent in communication net- works. In addition, it allows us to formulate the problems of control of the flow of information.

p(n) = [A”/n!] exp(-A), n = 0, 1 , 2 . . .

NEW METHODOLOGIES 263

Here the control consists of selecting a set of nodes, which have the power to influence or modify the performance of other nodes. Of course, to formulate the control problem, one has to formulate the ways the control nodes may affect the performance of other nodes, and also the criteria to be optimized. Such criteria may be the qualitative requirements that information should flow at a certain minimal rate, or reach the destination with minimum distortion.

This approach constitutes a change in a graph theory, by introducing con- trolled graphs. The theory may have applications in biology, e.g., in brain research, cancer research, etc., as well as in computer science, artificial intelli- gence, etc.

Also, in theoretical research on random graphs, this approach might initi- ate new directions, allowing for more adequate descriptions of the modeled phenomena.

References

1. M. Nowakowska, Theories of Research: Modeling Approaches. Intersystems

2. L.L. McQuitty, Pattern-Analytical Clustering: Theory, Methods, Research and

3. M.S. Aldenferder and R.K. Blashfield, Cluster Analysis, Sage Univ. Papers, Be-

4. H. Ch. Romesburg, Cluster Analysis for Researchers, Lifetime Learning Publ.,

5 . M. Lorr, Cluster Analysis for Social Scientists, Jossey-Bass, San Francisco, CA,

6. B. Eventt, Cluster Analysis, Halsted Press, New York, 1980. 7. J. Zupan, Clustering of Large Data Sets, Research Studies Press, New York, 1982. 8. J.E. Menich and H. Solomon, Taxonomy and Behavioral Sciences: Comparative

9. P. Arabie, J.D. Carroll, and W.S. DeSarbo, Three-Way Scaling and Clustering,

10. J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms,

11. J.E. Shore and R.M. Gray, Minimum Cross-Entropy Pattern Class8cation and

12. G.J. McLachlan and K.E. Basford, Mixture Models: Inference and Applications to

13. S . Karlin and H. Taylor, First Course in Stochastic Processes, Academic Press,

14. W. Feller, An Introduction to Probability Theory and Its Applications. Vol. 1 ,

Publ., Seaside, CA, 1984.

Configural Findings. University Press of America, New York, 1987.

verly Hills, CA, 1984.

Belmont, CA, 1983.

1983.

Performance of Grouping Methods, Academic Press, New York, 1980.

Sage Univ. Papers, Newburg Park, CA, 1987.

Plenum Press, New York, 1981.

Cluster Analysis, Naval Research Lab., Washington, DC, 1980.

Clustering, Marcel Dekker, New York, 1988.

New York, 1975.

Wiley, New York, 1957.