12
518 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001 Visualizing Latent Domain Knowledge Chaomei Chen, Jasna Kuljis, and Ray J. Paul Abstract—Knowledge discovery and data mining commonly rely on finding salient patterns of association from a vast amount of data. Traditional citation analysis of scientific literature draws insights from strong citation patterns. Latent domain knowledge, in contrast to the mainstream domain knowledge, often consists of highly relevant but relatively infrequently cited scientific works. Visualizing latent domain knowledge presents a significant chal- lenge to knowledge discovery and quantitative studies of science. In this paper, we build upon a citation-based knowledge visualiza- tion procedure and develop an approach that not only captures knowledge structures from prominent and highly cited works, but also traces latent domain knowledge through low-frequency citation chains. We apply this approach to two cases: 1) identifying cross-domain applications of Pathfinder networks (PFNETs) and 2) clarifying the current status of scientific inquiry of a possible link between Bovine spongiform encephalopathy (BSE), also known as mad cow disease, and a new variant Creutzfeldt–Jakob disease (vCJD), a type of brain disease in human. Index Terms—Citation chains, knowledge discovery, knowledge domain visualization (KDViz), latent domain knowledge. I. INTRODUCTION K NOWLEDGE tracking and technology monitoring become increasingly important for knowledge man- agement. The rapid advances of information visualization in recent years have highlighted its great potential in knowledge discovery and data mining [1]–[3]. Knowledge discovery and data mining commonly rely on finding salient patterns of association from a vast amount of data. Traditional citation analysis of scientific literature draws new insights from strong citation patterns. Such citation patterns typically reflect the view of the so-called “cream of the crop” in a knowledge domain, such as eminent scientists and their highly cited classic works. Latent domain knowledge, in contrast to the mainstream domain knowledge, often consists of highly rel- evant but relatively infrequently cited scientific works. Strong citation patterns become less common for latent domain knowl- edge: They could be completely missing or overwhelmed by the body of the mainstream domain knowledge. Visualizing latent domain knowledge presents a significant challenge to knowl- edge discovery and quantitative studies of science. There may be many reasons why a particular line of re- search may fall outside the body of the mainstream domain knowledge and become latent to a knowledge domain. In a cross-disciplinary research program, researchers face an Manuscript received June 1, 2001; revised October 1, 2001. C. Chen is with the College of Information Science and Tech- nology, Drexel University, Philadelphia, PA 19104-2875 USA (e-mail: [email protected]). J. Kuljis and R. J. Paul are with the Department of Information Systems and Computing, Brunel University, Uxbridge, U.K. (e-mail: [email protected]; [email protected]). Publisher Item Identifier S 1094-6977(01)11263-0. Fig. 1. Mainstream domain knowledge is typically high in both relevance and citation, whereas latent domain knowledge can be characterized as HR and LC. Fig. 2. Strategy of visualizing latent domain knowledge. The global context is derived from co-citation networks of highly cited works. An “exit” landmark is chosen from the global context to serve as the seeding paper in the process of domain expansion. The expanded domain consists of papers connecting to the seeding paper by citation chains of no more than two citation links. Latent domain knowledge is represented through a citation network of these papers. entirely unfamiliar scientific discipline. Tracking the latest de- velopment into a different discipline can be rather challenging. One example of such problems is the cross-disciplinary use of Pathfinder networks (PFNETs), a structural and procedural modeling method developed by cognitive psychologists in the 1980s [4], [5]. Pathfinder is a generic tool that has been adapted by several fields of study, including some quite different adap- tations from its original cognitive applications. For example, we have adapted PFNET scaling as an integral component of our generic structuring and visualization framework [1], [2], [6]. It is a challenging task to track down how applications of PFNETs have evolved over the past two decades across a number of apparently unconnected disciplines. Another type of latent domain knowledge can be explained in terms of scientific paradigms. Kuhn [7] described the devel- opment of science as interleaved phrases of normal science and 1094–6977/01$10.00 © 2001 IEEE

Visualizing latent domain knowledge - Systems, Man, and

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Visualizing latent domain knowledge - Systems, Man, and

518 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001

Visualizing Latent Domain KnowledgeChaomei Chen, Jasna Kuljis, and Ray J. Paul

Abstract—Knowledge discovery and data mining commonlyrely on finding salient patterns of association from a vast amountof data. Traditional citation analysis of scientific literature drawsinsights from strong citation patterns. Latent domain knowledge,in contrast to the mainstream domain knowledge, often consists ofhighly relevant but relatively infrequently cited scientific works.Visualizing latent domain knowledge presents a significant chal-lenge to knowledge discovery and quantitative studies of science.In this paper, we build upon a citation-based knowledge visualiza-tion procedure and develop an approach that not only capturesknowledge structures from prominent and highly cited works,but also traces latent domain knowledge through low-frequencycitation chains. We apply this approach to two cases: 1) identifyingcross-domain applications of Pathfinder networks (PFNETs) and2) clarifying the current status of scientific inquiry of a possiblelink between Bovine spongiform encephalopathy (BSE), alsoknown as mad cow disease, and a new variant Creutzfeldt–Jakobdisease (vCJD), a type of brain disease in human.

Index Terms—Citation chains, knowledge discovery, knowledgedomain visualization (KDViz), latent domain knowledge.

I. INTRODUCTION

K NOWLEDGE tracking and technology monitoringbecome increasingly important for knowledge man-

agement. The rapid advances of information visualization inrecent years have highlighted its great potential in knowledgediscovery and data mining [1]–[3].

Knowledge discovery and data mining commonly rely onfinding salient patterns of association from a vast amount ofdata. Traditional citation analysis of scientific literature drawsnew insights from strong citation patterns. Such citation patternstypically reflect the view of the so-called “cream of the crop” ina knowledge domain, such as eminent scientists and their highlycited classic works. Latent domain knowledge, in contrast tothe mainstream domain knowledge, often consists of highly rel-evant but relatively infrequently cited scientific works. Strongcitation patterns become less common for latent domain knowl-edge: They could be completely missing or overwhelmed by thebody of the mainstream domain knowledge. Visualizing latentdomain knowledge presents a significant challenge to knowl-edge discovery and quantitative studies of science.

There may be many reasons why a particular line of re-search may fall outside the body of the mainstream domainknowledge and become latent to a knowledge domain. Ina cross-disciplinary research program, researchers face an

Manuscript received June 1, 2001; revised October 1, 2001.C. Chen is with the College of Information Science and Tech-

nology, Drexel University, Philadelphia, PA 19104-2875 USA (e-mail:[email protected]).

J. Kuljis and R. J. Paul are with the Department of InformationSystems and Computing, Brunel University, Uxbridge, U.K. (e-mail:[email protected]; [email protected]).

Publisher Item Identifier S 1094-6977(01)11263-0.

Fig. 1. Mainstream domain knowledge is typically high in both relevance andcitation, whereas latent domain knowledge can be characterized as HR and LC.

Fig. 2. Strategy of visualizing latent domain knowledge. The global contextis derived from co-citation networks of highly cited works. An “exit” landmarkis chosen from the global context to serve as the seeding paper in the processof domain expansion. The expanded domain consists of papers connecting tothe seeding paper by citation chains of no more than two citation links. Latentdomain knowledge is represented through a citation network of these papers.

entirely unfamiliar scientific discipline. Tracking the latest de-velopment into a different discipline can be rather challenging.One example of such problems is the cross-disciplinary useof Pathfinder networks (PFNETs), a structural and proceduralmodeling method developed by cognitive psychologists in the1980s [4], [5]. Pathfinder is a generic tool that has been adaptedby several fields of study, including some quite different adap-tations from its original cognitive applications. For example,we have adapted PFNET scaling as an integral component ofour generic structuring and visualization framework [1], [2],[6]. It is a challenging task to track down how applicationsof PFNETs have evolved over the past two decades across anumber of apparently unconnected disciplines.

Another type of latent domain knowledge can be explainedin terms of scientific paradigms. Kuhn [7] described the devel-opment of science as interleaved phrases of normal science and

1094–6977/01$10.00 © 2001 IEEE

Page 2: Visualizing latent domain knowledge - Systems, Man, and

CHEN et al.: VISUALIZING LATENT DOMAIN KNOWLEDGE 519

Fig. 3. Spiral methodology of domain expansion and visualization.

scientific revolutions. A period of normal science is typicallymarked by the dominance of an established framework. Thefoundations of such frameworks largely remain unchallengeduntil new discoveries begin to cast doubts over fundamentalissues—science falls into a period of crises. To resolve suchcrises, radically new theories are introduced. New theories re-place with greater explanatory power the ones in trouble in arevolutionary manner. Science regains another period of normalscience.

Kuhn suggested that a paradigm shift in science should leadto a corresponding change of citation patterns in scientific litera-ture; therefore, the study of such patterns may provide indicatorsof the development of a scientific paradigm. Indeed, a numberof researchers pursued this line of research since 1970s. For ex-ample, Small [8] studied the movement of highly cited publi-cations on the topic of collagen as a means of tracking majorparadigm shifts in this particular field. White and McCain [9]used INDSCAL to depict changes in author co-citation mapsover consecutive periods . We have started to investigate howinformation visualization can help us characterize the dynamicsof scientific paradigms [10]. In particular, our focus is on con-temporary puzzle-solving topics in science and medicine: Whatcaused dinosaurs’ mass extinction? Are Bovine spongiform en-cephalopathy (BSE) and the new variant Creutzfeldt–Jakob dis-ease (vCJD) connected? What powers active galactic centers,super-massive black holes, or something else?

In this paper, we introduce an approach to visualizing latentdomain knowledge. We demonstrate how one can accommodatelatent domain knowledge and the mainstream domain knowl-edge within the same visualization framework. This paper in-cludes two case studies: 1) PFNET applications and 2) theo-

Fig. 4. Overview of the mainstream domain knowledge.

ries of BSE, commonly known as mad cow disease. The restof the paper is organized as follows. First, we outline existingwork, including citation analysis, knowledge discovery, and ex-amples. We then extend our domain visualization approach tovisualize latent domain knowledge. We apply this approach to

Page 3: Visualizing latent domain knowledge - Systems, Man, and

520 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001

Fig. 5. Global context of the Pathfinder case. Applications of PFNETs are found in a broader context of knowledge management technologies, such as knowledgeacquisition, knowledge discovery, and artificial intelligence. A majority of PFNET users are cognitive psychologists.

two cases in which visualizing latent domain knowledge is in-volved: 1) tracing applications of PFNETs and 2) connectinga controversial theory of BSE, mad cow disease, to the main-stream intellectual structure of research in BSE.

II. K NOWLEDGE DISCOVERY

The advances of information visualization have revivedthe interest in a number of challenging issues concerningknowledge tracking. We contrast two strands of research in thispaper: 1) the citation-based paradigm of knowledge discoveryand 2) the undiscovered public knowledge approach. The keyprerequisite for the citation-based paradigm is a target scientificliterature that is rich in citations, whereas the undiscoveredpublic knowledge deals with exactly the opposite situationwhen citation links are missing or are considerably rare. Asynergy of the two would lead to a more powerful tool tofacilitate knowledge discovery and knowledge management ingeneral.

A. Citation-Centric Paradigm for Knowledge Visualization

Pioneering examples of using citation data to map the struc-ture of science include the creation of the historical map of re-search in DNA [11], mapping scientific networks [12], and astudy of Nobel Prize winners’ citation profiles [13].

Since the 1970s, information scientists began to look for waysto reveal patterns and trends in science through studies of sci-entific literature. In 1974, Small and Griffith [14] addressed is-sues concerning mapping specialties based on co-citation pat-terns . In 1977, Small conducted a longitudinal study of col-lagen research and showed that some rapid changes of focustook place in the research [8]. He used data from the science ci-tation indexing (SCI) to compute co-citation strengths between

pairs of documents and subsequently clustered documents toidentify leading specialties, or paradigms. He then used multidi-mensional scaling (MDS) to map highly cited papers each yearin clusters on a two-dimensional (2-D) plane. An abrupt disap-pearance of a few key documents in the leading cluster in oneyear and a rapidly increased number of documents in the leadingcluster in the following year indicate an important type of spe-cialty change in terms of rapid shift in research focus, which isan indicator of “revolutionary” changes. Garfield [15] discussedlongitudinal mapping as a series of chronologically sequentialmaps to depict how knowledge advances.

White and Griffith [16] introduced author co-citation analysis(ACA) in 1981 as a literature measure of intellectual structure .In ACA, the unit of analysis is authors and their intellectual rela-tionships as reflected through scientific literature. McCain [17]produced a comprehensive technical review of mapping authorsin intellectual spaces. In 1998, White and McCain applied ACAon information science [9].

Intellectual groupings of highly cited authors or their papersrepresent the underlying knowledge domain. We have devel-oped a four-step domain visualization procedure to support bothACA and document co-citation analysis (DCA) [2]. Co-cita-tion patterns are typically represented by a co-citation matrixbased on citation data drawn from citation databases such asthe SCI. There are a range of valid entities for co-citation anal-ysis, e.g., author–author, document–document, journal–journal,descriptor–descriptor, and even country–country. A co-citationmatrix of more than 100 entities can be rather complex. Re-searchers routinely use clustering algorithms to divide such ma-trices into smaller ones. Matrix analysis techniques such as sin-gular value decomposition (SVD) and eigenvalue decomposi-tion are also available for this divide-and-conquer purpose. We

Page 4: Visualizing latent domain knowledge - Systems, Man, and

CHEN et al.: VISUALIZING LATENT DOMAIN KNOWLEDGE 521

incorporate PFNET scaling into our approach. The strength ofPFNET scaling is that it can simplify a complex co-citation net-work but also keep the salient structure intact.

In co-citation analysis, factor analysis is often used to reducethe dimensionality of co-citation data. Each identified factorrepresents a specialty in a given knowledge domain. Therefore,one can break a knowledge domain down to several specialtiesin such a way. Normally, we are interested in the most signifi-cant factors identified by factor analysis in terms of their expla-nation power. In contrast, co-citation analysis often uses MDSto project a high-dimensional data space to a 2-D or three-di-mensional (3-D) one so that inter-entity relationships can bedepicted in a 2-D or 3-D graphical representation. However, ithad not been a common practice in co-citation analysis to con-solidate the results of both factor analysis and MDS configura-tion into the same graphical representation. Our domain visual-ization procedure renders the distribution of specialties acrossa co-citation landscape. In addition to structural connectivity,the membership of a specialty is color-coded. Entities in similarcolors are likely to belong to the same specialty. The advantageof this approach is that the structural connectivity and the spe-cial membership color can reinforce each other and make thevisualization easy to understand. Finally, the magnitude of im-pact of a particular entity is depicted through the height of itscitation bar, showing a stack of color-coded annual citation sec-tions.

B. Undiscovered Public Knowledge—The Missing LinkParadigm

Swanson describes three aspects of the context and nature ofknowledge fragmentation [18].

1) There is an enormous and constantly growing gap be-tween the entire body of recorded knowledge and the lim-ited human capacity to make sense of it.

2) Inadequate cross-specialty communication causes knowl-edge fragmentation. In response to the information explo-sion, specialties are increasingly divided into more andmore narrowly focused subspecialties.

3) One specialty might not be aware of potentially valuableinformation in another specialty. Two specialized litera-tures may be isolated in terms of explicit citation links,but they may have implicit, latent connections at the textlevel.

Swanson has been pursuing his paradigm since 1986 when hefound two sizeable biomedical literatures: one is on the circula-tory effects of dietary fish oil and the other is on the peripheralcirculatory disorder, Raynaud’s disease. Swanson noticed thatthese two literatures were not bibliographically-related, i.e., noone from one camp cited works in the other [19]. On the otherhand, he was pondering the question that apparently no one hadasked before: Is there a connection between dietary fish oil andRaynaud’s disease?

Prior to Swanson’s research, no medical researcher had no-ticed this connection, and the indexing of these two literatureswas unlikely to facilitate the discovery of any such connections.Swanson’s approach can be represented in a generic form. Giventwo premises that causes ( ) and that causes

TABLE ILEADING ARTICLES IN THE THREE LARGESTSPECIALTIES RANKED BY THE

STRENGTH OFFACTOR LOADING. ABSOLUTE VALUES LESSTHAN 0.500ARE

SUPPRESSEDFROM THE TABLE. FACTORSF1, F2,AND F3 DEFINE THREE

SPECIALTIES. THE “EXIT” L ANDMARK BELONGS TO THEFIRST SPECIALTY

( ), the question to ask is whethercauses ( ).If the answer is positive, the causal relation has the transitiveproperty. In the biological world, such transitive properties maynot always be there. Therefore, scientists must explicitly es-

Page 5: Visualizing latent domain knowledge - Systems, Man, and

522 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001

Fig. 6. Citation map showing that the most prolific themes of PFNET applications include measuring the structure of expertise, eliciting knowledge,measuringthe organization of memory, and comparing mental models. No threshold is imposed.

tablish such transitivity relationships. Swanson [18] suggeststhat once information scientists identify such possibilities, theyshould recommend domain experts to validate them.

Swanson’s approach focuses on the discovery of such hy-potheses from the vast amount of implicit, or latent, connec-tions. Swanson and Smalheiser [20] defined the concept ofnon-interactive literature. If two literatures have never been citedtogether at a notable level, they are noninteractive, i.e., scien-tists have not considered both of them together. In the past 15years, Swanson identified several missing links of the same pat-tern, notably migraine and magnesium [21] and arginine and so-matomedin [22]. Since 1994, the collaboration between neu-rologist Smalheiser and Swanson led to a few more such cases[23], [24]. They also made their softwareArrowsmithavailableon the Internet [25].

Swanson’s approach relies on the identification of the twopremises: and . In a large knowledge domain,it is crucial for analysts to have sufficient domain knowledge.Otherwise, to find two such premises is like searching for nee-dles in a haystack. Knowledge domain visualization (KDViz)can narrow down the search space and increase the chance offinding a fruitful line of scientific inquiry.

C. Visualizing Latent Domain Knowledge

In this paper, we distinguish mainstream domain knowledgeand latent domain knowledge along two dimensions: relevanceand citation. Scientific documents in the literature can be clas-sified into four categories according to their relevance to thesubject domain and their citations received from the scientificliterature: mainstream domain knowledge, which typically con-sists of documents of HR and high citations (HC); latent do-main knowledge, which are typically made of documents of

HR but low citations (LC); and two categories of documentsof low relevance. The traditional knowledge discovery such ascitation analysis and domain visualization focuses on the main-stream domain knowledge (HRHC). The focus of latent do-main knowledge discovery and visualization is on the categoryof HR and LC. In this paper, we introduce an approach that canextend the coverage of KDViz from mainstream to latent do-main knowledge (see Fig. 1).

In our earlier work [3], we developed a four-step procedurefor visualizing mainstream domain knowledge. In particular, theprocedure consists of the following four steps.

Step 1) Select highly relevant and highly cited documentsfrom a citation database.

Step 2) Derive citation networks based on the selected pop-ulation of documents and simplify citation networksusing PFNET scaling.

Step 3) Partition the resultant PFNET according to special-ties identified through principal component analysis.

Step 4) Superimpose the citation history of a document orauthor over the citation network.

Our solution to visualizing latent domain knowledge is builtupon this four-step procedure. Instead of simply applying theprocedure on highly relevant and highly cited documents, weincorporate this procedure into a recursive process particularlysuitable for detecting patterns in highly relevant but sparselycited documents. Fig. 2 illustrates the overall strategy of our ap-proach. This approach has three subprocesses. The purpose ofthe first process is to establish a global context by subsequentanalysis and visualization. Indeed, in this process, we apply ourfour-step procedure to the mainstream domain knowledge andgenerate a citation landscape. The second process is domain ex-pansion, which means that we expand our field of view from

Page 6: Visualizing latent domain knowledge - Systems, Man, and

CHEN et al.: VISUALIZING LATENT DOMAIN KNOWLEDGE 523

mainstream domain knowledge to latent domain knowledge. Akey component in this domain expansion process is the selectionof a so-called “exit” landmark from the citation landscape. This“exit” landmark will play a pivot role in tracking latent knowl-edge by “pulling” highly relevant but relatively rarely cited doc-uments into the scene. The “exit” landmark is selected basedon both structural and topical characteristics. Structurally im-portant documents in the citation landscape include branchingpoints, from which one can reach more documents along citationpaths preserved by the network. Topically important documentsare the ones that are closely related to the subject in question.Ideally, a good “exit” landmark should be a classic work in afield of study and it can link to a cluster of closely related doc-uments by citation. We will explain in more detail through casestudies how we choose “exit” landmarks. Once an “exit” land-mark is chosen from the citation landscape, the four-step proce-dure can be applied again to all the documents within a citationchain of up to two citation links. The resultant citation networkrepresents the latent domain knowledge. Finally, we embed thislocal structure back into the global context by providing a refer-ence from the “exit” landmark in the global context to the latentknowledge structure.

In this paper, we describe how we applied this approach totwo case studies. One is about cross-domain applications ofPFNET scaling techniques. The other is about clarifying theperceived connection between BSE and vCJD in contemporaryliterature. We use the Web of Science, a Web-based interfaceto citation databases compiled by the Institute for Scientific In-formation (ISI). We start with a search in the Web of Scienceusing some broad search terms in order to generate a global con-text for subsequent visualization. For example, in the Pathfindercase, we chose to use search terms such asknowledge discovery,knowledge acquisition, knowledge modeling, and Pathfinder.Once the global context is visualized, it is straightforward toidentify an “exit” landmark. In the Pathfinder case, a classic ci-tation of PFNETs is chosen as an “exit” landmark. This “exit”landmark paper serves as the seed in a citation search within theWeb of Science. The citing space of the seeding papercon-tains papers that either cite the seeding paper directly or cite anpaper that in turn cites the paper

Such citing spaces may contain papers beyond the boundaryof the mainstream domain knowledge. The spiral shape of thediagram shown in Fig. 3 implies that one can repeatedly applythis method by identifying another “exit” landmark. Papers con-nected to the landmark by two-step citation chains are gath-ered to represent latent domain knowledge. By using differentways to select citing papers, we can visualize latent knowledgestructures with reference to highly established and frequentlycited knowledge structures. In the following two case studies,we apply the same spiral methodology to illustrate our approach.

TABLE IILEADING ARTICLES IN THETHREEMOSTPROMINENT SPECIALTIESRANKED BY

THE STRENGTH OFFACTOR LOADING. ABSOLUTE VALUES LESSTHAN 0.500ARE SUPPRESSEDFROM THE TABLE. AT LEAST ABOVE-THRESHOLDFACTOR

LOADING IS REQUIRED TO BEINCLUDED IN THE LISTING. THE FIRST

MEMBER OF THEFIRST SPECIALTY IS THE “EXIT” L ANDMARK CHOSEN

FOR DOMAIN EXPANSION

III. CASE STUDY I: PATHFINDER NETWORKS

In the Pathfinder case, we focus on cross-domain applicationsof PFNETs. It is particularly important to ensure that nonmain-stream applications of PFNETs are not overshadowed by main-stream citation peaks.

Page 7: Visualizing latent domain knowledge - Systems, Man, and

524 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001

Fig. 7. Branch representing a new paradigm of incorporating PFNETs intoGSA, a generic framework for structuring and visualization, and its applicationsespecially in strengthening traditional citation analysis.

A. Pathfinder Network (PFNETs) Scaling

PFNET scaling was originally developed by cognitive psy-chologists for structuring modeling [5]. It relies on a triangle in-equality condition to select the most salient relations from prox-imity data. PFNETs have the same set of vertices as the originalgraph. The number of edges in a PFNET, on the other hand, canbe largely reduced.

The topology of a PFNET is determined by two parameters( and ) and the corresponding network is denoted as PFNET( ). The -parameter controls the scope that the triangularinequality condition should be imposed. The-parameter refersto the Minkowski metric used for computing the distance of apath. The weight of a path with links, , is determinedby weights , of each individual link as follows:

The Minowski distance (geodetic) depends on the value of the-metric. For , the path weight is the sum of the link

weights along the path; for , the path weight is computedas Euclidean distance; and for , the path weight is thesame as the maximum weight associated with any link alongthe path

The -parameter specifies that triangle inequalities must be sat-isfied for paths with links

TABLE IIISTRONG NEGATIVE FACTOR LOADING IN FACTOR ONE SUGGESTING A

UNIQUE SPECIALTY. THESEARTICLES PATHFINDER NETWORKS AREUSED,BUT NOT IN ANY WAY SIMILAR TO A TYPICAL PUBLICATION IN THE

PATHFINDER SPECIALTY

When a PFNET satisfies the following three conditions, the dis-tance of a path is the same as the weight of the path:

1) The distance from a document to itself is zero.2) The proximity matrix for the documents is symmetric;

thus, the distance is independent of direction.3) The triangle inequality is satisfied for all paths with up to

links.

If is set to the total number of nodes less one, then thetriangle inequality is universally satisfied over the entire net-work. The number of links in a network can be reduced by in-creasing the value of parameteror . The geodesic distance be-tween two nodes in a network is the length of the minimum-costpath connecting the nodes. A minimum-cost network (MCN),PFNET ( , ), has the least number of links. See[1], [2], [5], and [6] for further details.

Typical applications of PFNETs include modeling a networkof concepts based on similarity ratings given by human ex-perts, constructing procedural and protocol analysis models ofcomplex activities such as air-traffic control, and comparinglearners’ PFNETs at various stages of their learning [4].

Page 8: Visualizing latent domain knowledge - Systems, Man, and

CHEN et al.: VISUALIZING LATENT DOMAIN KNOWLEDGE 525

Fig. 8. Schvaneveldtl’s “exit” landmark in the landscape of the thematic visualization.

In our earlier research, we incorporated PFNETs into ourgeneralized similarity analysis (GSA) framework [1], [26]. Tra-ditionally a typical application of PFNETs rely on proximitydata judged manually. The number of nodes in a typical PFNETranges from 30 to 50, although PFNETs of 2000 nodes were re-ported in one occasion in the 1980s [5]. We introduced a varietyof computer-generated proximity measures along with GSA in-cluding document-document similarity computed based on in-formation retrieval models, state transition probabilities derivedfrom a Web navigation, and co-citations of authors as well asdocuments [6]. These proximity data have extended the scopeof PFNETs to a much wider variety of phenomenon beyond theamount of proximity data one can measure by hand. This exten-sion has not only enriched the topological properties of PFNETsbut also led to valuable insights into the meaning of PFNETs.The Pathfinder case study is motivated by the question: Howdoes this extension fit into the general picture of PFNET appli-cations with reference to traditional Pathfinder applications?

B. Mainstream Domain Knowledge

The global context of the Pathfinder case shown in Fig. 5contains clusters of papers on knowledge discovery, knowledgeacquisition, classification and machine learning, artificial in-telligence, expert systems, and domain knowledge modeling.Pathfinder-related papers are located in the far side of the land-scape view, near the area labels of cognitive psychology and

expert systems (see Fig. 4). This indicates that applications ofPFNETs are closely to these two broad categories. In order topursue latent knowledge structures associated with PFNETs.Schvaneveldt’s 1985 paper was chosen as the first “exit” land-mark because it is located at a point connecting the Pathfinder“peninsula” to other areas in the landscape.

Table I lists further details concerning the structure of theglobal context as derived from factor analysis. Up to 20 leadingpapers in each of the three largest factors, or specialties, arelisted. In essence, Factor 1 corresponds to research in PFNETs.Factor 2 corresponds to classic artificial intelligence. Factor 3corresponds to expert systems and decision support systems.The higher a factor loading, the more typical an paper is as arepresentative member of the specialty. On the other hand, if anpaper has a wide impact, then its loadings on individual factorsmay not be exceedingly high.

C. Latent Domain Knowledge

Fig. 6 shows the latent knowledge structure derived from theciting space of the “exit” landmark paper. This structure is notovershadowed by HC of classic artificial intelligence papers, butit maintains a connecting point with the global context throughthe “exit” landmark, which is the highest citation bar half waydown in the branch pointing to the lower right corner. This de-tailed local structure shows more papers related to the use ofPathfinder.

Page 9: Visualizing latent domain knowledge - Systems, Man, and

526 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001

Similarly, Table II shows leading papers in this latent knowl-edge structure. The classification is more detailed than the onein the global context.

Fig. 7 shows an extended branch from the main PFNET. Thisbranch represents a new area of applying PFNETs. In fact, this isthe area in which PFNETs have been adapted for citation-basedvisualizations.

Table III reveals the fact that papers in this branch all havenegative loadings on Factor 1 and are virtually absent from theremaining factors. This is interesting because, on the one hand,the first specialty provides a dimension that can account for boththe traditional applications of Pathfinders and the new branchof applications. On the other hand, since documents in the newbranch are so consistently classified by factor loading, they canbe treated as a subspecialty.

Fig. 8 shows a simple research function which lights up all thepapers by Schvaneveldt, a central figure in the development ofPFNET scaling. The position of each lit paper and the directionof the hosting branch provide insightful information into thenature of the paper and the branch.

IV. CASE STUDY II: POSSIBLE LINKS BETWEEN

BSEAND vCJD

Prusiner, a Nobel prize winner for his discovery of prions—atype of bad protein, suggested that an abnormal form of a pro-tein is responsible for diseases such as scrapie in sheep, BSE incattle (mad cow disease), and Creutzfeldt–Jakob disease (CJD)in humans. These diseases are known as transmissible spongi-form encephalopathy (TSE).

A. Mainstream Domain Knowledge

BSE was first found in 1986 in the U.K.. A sponge-like mal-formation was found in the brain tissue from affected cattle.It was identified as a new prion disease, a new TSE disease.The BSE epidemic in the U.K. reached its peak in 1992 andhas since steadily declined. CJD was first discovered in the1920s by two German neurologists. It is the principal form of anumber of human TSE diseases. In humans, the prion-based dis-ease is related to CJD, Kuru (transmitted by cannibalism), Ger-stmann–Sträussler–Scheinker disease (GSS), and fatal familialinsomnia (FFI).

The vCJD is an unrecognized variant of CJD discovered bythe National CJD Surveillance Unit in Edinburgh, U.K. ThevCJD is characterized clinically by a progressive neuropsy-chiatric disorder. Neuropathology shows marked spongiformchange throughout the brain. The media reported a growingconcern in the general public that BSE may have passed fromcattle to humans. The British government assured the publicthat the beef is safe, but in 1996 it announced there is possiblya link between BSE and vCJD. The central question in this casestudy is what scientific literature tells us about the possible linkbetween BSE and vCJD.

First, we generated a mainstream-driven thematic landscapeof the topic of BSE and CJD by searching the Web of Sciencewith the term “BSE or CJD” (see Fig. 9). The strongest specialty,Prion protein, is colored in red; the BSE specialty is in green;and the CJD specialty is in blue. In particular, the very light color

Fig. 9. Global view of mainstream research in BSE and vCJD.

TABLE IVTHE CITATION PROFILE OFPURDEY’S ARTICLES. THE ID NUMBERS ARE

CROSS-REFERENCES TOLABELS IN FIG. 12. THE HIGHESTCITATION COUNT IS

9, WHICH IS CONSIDERABLY LESSTHAN CITATIONS OF MORE THAN 900TIMES RECEIVED BY KEY ARTICLES ONPRION THEORY

Fig. 10. Year-by-year animation highlights the dynamics of citation trends.Four snapshots are extracted from the animation sequence of the BSE data(1981–2001), containing papers on two-step citation chains to Brown’s 1997paper.

of the vCJD specialty indicates that this is an area where otherspecialties overlap.

B. Manganese–Copper Hypothesis

The mainstream view on BSE has focused on the food chain:cows got BSE by eating feed made from sheep infected withscrapie, and, similarly, humans get vCJD by eating BSE infectedbeef. However, Purdey, a British organic dairy farmer, believedthat the unbalanced manganese and copper in the brain is thereal cause of BSE and vCJD [27]. He studied the environment

Page 10: Visualizing latent domain knowledge - Systems, Man, and

CHEN et al.: VISUALIZING LATENT DOMAIN KNOWLEDGE 527

Fig. 11. Expanded knowledge domain, featuring Brown’s paper, which is the “exit” landmark, and Prusiner’s original paper on prion theory. The inset shows therelative location of this local view in the global context.

Fig. 12. All Purdey’s papers are located in the same cluster as three highly cited papers on possible connections between BSE and CJD. The upper-left frameshows the locations of Purdey’s papers in this cluster. The paper numbers are cross references to the bibliographic details in Table IV. The lower-left frame showsthe position of this cluster in the global context.

in areas known to have found spongiform diseases, such as Col-orado in the U.S., Iceland, Italy, and Slovakia. He found a highlevel of manganese and low levels of copper in all of them.

Purdey’s research on the manganese–copper hypothesisshows the sign of latent domain knowledge. He has publishedin scientific journals, but they are not highly cited by otherresearchers (see Table IV). We need to find a gateway fromwhich we can expand the global landscape of mainstreamresearch in BSE and vCJD and place Purdey’s research into

the big picture of this issue. Recall that we need to find an“exit” landmark in the global landscape to conduct the domainexpansion, but none of Purdey’s publications was featured inthe scene. To solve this problem, we need to find someone whois active in the manganese–copper paradigm and also includedin the mainstream visualization view.

Brown, a biochemist at Cambridge University, Cambridge,U.K., is among scientists who did cite Purdey’s publications.Brown provides a good candidate for an “exit” landmark.

Page 11: Visualizing latent domain knowledge - Systems, Man, and

528 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS, VOL. 31, NO. 4, NOVEMBER 2001

On the one hand, Brown is interested in the role of themanganese–copper balance in prion diseases [28] and hecited Purdey’s papers. On the other hand, he is interested inPrusiner’s prion theory and published about 50 papers onprion diseases. Indeed two of his papers are featured in themainstream view visualization of the case study. We chose his1997 paper published inExperimental Neurologyas the “exit”landmark (see Fig. 11). The citing space to his work is shownin Figs. 10 and 11.

Fig. 11 shows the location of the “exit” landmark paper byBrown. Prusiner’s famous 1982 paper on prion theory is alsolocated nearby. Brown’s research interest in prion theory is ap-parent from its closeness to Prusiner’s paper. Fig. 12 shows moredetails of the manganese–copper paradigm, in particular the lo-cations of Purdey’s papers. Four of his papers are located withinthe same cluster as three scientists well cited for their works invCJD, namely, Collinge, Bruce, and Will.

Because of the relatively LC rates of Purdey’s papers, conven-tional citation analysis is unlikely to take them into account. Pre-dominant papers in this cluster all address the possible link be-tween BSE and vCJD. This observation suggests how Purdey’spapers might fit into the mainstream domain knowledge.

We have demonstrated it that our approach can be success-fully applied to find connections that would be otherwise ob-scured. The BSE case study has shown that Purdey’s theory isfeeding in the mainstream research on BSE and CJD throughBrown and his group.

V. DISCUSSIONS ANDCONCLUSIONS

In order to track the development of scientific paradigms, it isimportant to take into account latent domain knowledge as wellas mainstream domain knowledge. By incorporating an infor-mation visualization procedure originally developed for visual-izing mainstream domain knowledge into a recursive process,we have been able to demonstrate the potential of our approachto visualizing not only highly relevant and highly cited doc-uments, but also highly relevant but infrequently cited docu-ments.

Typical citation-based domain visualization approaches havefocused on citation frequencies of high-profiled research in aknowledge domain. Consequently, resultant visualizations arestrongly biased toward highly cited works. Although highlycited works constitute the core knowledge of a domain, theirpresence inevitably outshines the presence of latent domainknowledge if they are measured by the same yardstick. The useof two-step citation chains allows us to glean latent domainknowledge and maintain the global picture of where such latentdomain knowledge fits.

A natural extension of the research is to explore ways that cancombine approaches based on citation patterns and those basedon word-occurrence patterns to pinpoint a significant mismatchbetween the citation strength and word co-occurrence patterns.There are other potentially useful ways to uncover latent do-main knowledge. Many techniques developed in scientometricsfor quantitative studies of science can be used to generate struc-tural representations of domain knowledge. By comparing andcontrasting differences across a variety of structural representa-

tions one can expect to spot missing links and potentially note-worthy connections. For example, if a co-word analysis revealsa strong link between intellectually-related works. In contrast,if such links are absent or weak in citation networks, then itcould be important for scientists to know whether they mighthave overlooked something potentially significant.

On the one hand, visualizing domain knowledge in general isa revival of a long established quest for quantitative studies ofscientific discoveries and scientific paradigms, especially dueto the advances in enabling techniques such as digital librariesand information visualization. On the other hand, visualizingdomain knowledge should set its own research agenda in thenew era of science and technology so as to provide valuabledevices for scientists, philosophers of science, sociologists ofknowledge, librarians, government agencies, and others to graspcrucial developments in science and technology.

In this paper, we have examined the role of citation chains invisualizing latent domain knowledge. The new visualization ap-proach can not only capture the intellectual structure of highlycited works but also make it possible to uncover connections be-tween latent domain knowledge and the body of the mainstreamdomain knowledge. The two case studies have shown that thisapproach has the potential to be a new way of supporting knowl-edge tracking and knowledge management.

ACKNOWLEDGMENT

The authors would like to thank the anonymous referees andthe guest editors for their valuable comments that helped to im-prove the quality of the original manuscript.

REFERENCES

[1] C. Chen, Information Visualization and Virtual Environ-ments. London, U.K.: Springer-Verlag London, 1999.

[2] C. Chen and R. J. Paul, “Visualizing a knowledge domain’s intellectualstructure,”IEEE Computer, vol. 34, pp. 65–71, 2001.

[3] C. Chen, “Visualization of knowledge structures,” inHandbook of Soft-ware Engineering and Knowledge Engineering, S. K. Chang, Ed, Sin-gapore: World Scientific, 2002, vol. 2, p. 700.

[4] R. W. Schvaneveldt,Pathfinder Associative Networks: Studies in Knowl-edge Organization. ser. Ablex Series in Computational Sciences, D. Par-tridge, Ed. Norwood, NJ: Ablex, 1990.

[5] R. W. Schvaneveldt, F. T. Durso, and D. W. Dearholt, “Network struc-tures in proximity data,” inThe Psychology of Learning and Motivation,24, G. Bower, Ed. New York: Academic, 1989, pp. 249–284.

[6] C. Chen, “Visualising semantic spaces and author co-citation networksin digital libraries,” Inform. Process. Manage., vol. 35, pp. 401–420,1999.

[7] T. S. Kuhn,The Structure of Scientific Revolutions. Chicago, IL: Univ.Chicago Press, 1962.

[8] H. G. Small, “A co-citation model of a scientific specialty: A longitu-dinal study of collagen research,”Social Stud. Sci., vol. 7, pp. 139–166,1977.

[9] H. D. White and K. W. McCain, “Visualizing a discipline: An authorco-citation analysis of information science, 1972–1995,”J. Amer. Soc.Inform. Sci., vol. 49, pp. 327–356, 1998.

[10] C. Chen, T. Cribbin, R. Macredie, and S. Morar, “Visualizing andtracking the growth of competing paradigms: Two case studies,”J.Amer. Soc. Inform. Sci., 2001, to be published.

[11] E. Garfield, S. I. H. , and R. J. Torpie,The Use of Citation Data in Writingthe History of Science. Philadelphia, PA: Inst. Sci. Inform., 1964.

[12] D. D. Price, “Networks of scientific papers,”Sci., vol. 149, pp. 510–515,1965.

[13] I. Sher and E. Garfield, “New tools for improving and evaluating theeffectiveness of research,” inProc. Res. Program Effectiveness, Wash-ington, DC, 1966.

Page 12: Visualizing latent domain knowledge - Systems, Man, and

CHEN et al.: VISUALIZING LATENT DOMAIN KNOWLEDGE 529

[14] H. G. Small and B. C. Griffith, “The structure of scientific litera-tures—Part I: Identifying and graphing specialties,”Sci. Stud., vol. 4,pp. 17–40, 1974.

[15] E. Garfield, “Scientography: Mapping the tracks of science,”CurrentContents: Social Behavioral Sci., vol. 7, pp. 5–10, 1994.

[16] H. D. White and B. C. Griffith, “Author co-citation: A literature measureof intellectual structure,”J. Amer. Soc. Inform. Sci., vol. 32, pp. 163–172,1981.

[17] D. C. He and L. Wang, “Texture unit, texture spectrum, and texture anal-ysis,” IEEE Trans. Geosci. Remote Sensing, vol. 28, 1990.

[18] D. R. Swanson, “On the fragmentation of knowledge, the connection ex-plosion, and assembling other people’s ideas,”Bull. Amer. Soc. Inform.Sci. Technol., vol. 27, pp. 12–14, 2001.

[19] , “Fish oil, Raynauds syndrome, and undiscovered public knowl-edge,”Pers. Biol. Med., vol. 30, pp. 7–18, 1986.

[20] D. R. Swanson and N. R. Smalheiser, “An interactive system for findingcomplementary literatures: A stimulus to scientific discovery,”Artif. In-tell., vol. 91, pp. 183–203, 1997.

[21] D. R. Swanson, “Migraine and magnesium—Eleven neglected connec-tions,” Pers. Biol. Med., vol. 31, pp. 526–557, 1988.

[22] , “Somatomedin-C and arginine—Implicit connections betweenmutually isolated literatures,”Pers. Biol. Med., vol. 33, pp. 157–186,1990.

[23] N. R. Smalheiser and D. R. Swanson, “Assessing a gap in the biomedicalliterature—Magnesium-deficiency and neurologic disease,”Neurosci.Res. Commun., vol. 15, pp. 1–9, 1994.

[24] , “Indomethacin and Alzheimer’s disease,”Neurol., vol. 46, p. 583,1996.

[25] D. R. Swanson, “Computer-assisted search for novel implicit connec-tions in text databases,”Abstr. Papers Amer. Chem. Soc., vol. 217, p.010-CINF, 1999.

[26] C. Chen, “Generalized similarity analysis and pathfinder networkscaling,”Interacting Comput., vol. 10, pp. 107–128, 1998.

[27] E. Stourton,Mad Cows and an Englishman. London, U.K.: BBC2,2001.

[28] D. R. Brown, F. Hafiz, L. L. Glasssmith, B. S. Wong, I. M. Jones, C.Clive, and S. J. Haswell, “Consequences of manganese replacement ofcopper for prion protein function and proteinase resistance,”EMBO J.,vol. 19, pp. 1180–1186, 2000.

Chaomei Chenreceived the Ph.D. degree in computer science from the Uni-versity of Liverpool, Liverpool, U.K., in 1995.

Currently, he is an Associate Professor at the College of Information Sci-ence and Technology, Drexel University, Philadelphia, PA. His research interestsinclude information visualization, knowledge discovery, domain visualization,human–computer interaction, and hypertext. .

Dr. Chen is a Member of the ACM, the IEEE Computer Society, and theAmerican Society for Information Science and Technology.

Jasna Kuljis received the Ph.D. degree in information systems from the LondonSchool of Economics, University of London, London, U.K., in 1995.

Currently, she is a Senior Lecturer at the Department of Information Systemsand Computing, Brunel University, Uxbridge, U.K. Her current research inter-ests include human–computer interfaces, graphical user interface design, andthe usability of interactive computer systems.

Ray J. Paul received the Ph.D. degree in operational research from Hull Uni-versity, Hull, U.K., in 1974.

Currently, he is a Professor of simulation modeling in the Department of In-formation Systems and Computing, Brunel University, Uxbridge, U.K. His re-search interests are simulation modeling processes, software environments forsimulation modeling, and information systems development.

Dr. Paul is a Member of the ACM, the IEEE Computer Society, the Opera-tional Research Society, the British Computer Society, and the Society for Com-puter Simulations. He is the European Chapter Chairman of ACM SIGSIM.